Prediction on Diabetes Patient's Hospital Readmission¶
- Problem Statement and Objective
A hospital readmission is when a patient who is discharged from the hospital, gets re-admitted again within a certain period of time. Hospital readmission rates for certain conditions are now considered an indicator of hospital quality, and also affect the cost of care adversely. For this reason, Centers for Medicare & Medicaid Services established the Hospital Readmissions Reduction Program which aims to improve quality of care for patients and reduce health care spending by applying payment penalties to hospitals that have more than expected readmission rates for certain conditions. Although diabetes is not yet included in the penalty measures, the program is regularly adding new disease conditions to the list, now totaling 6 for FY2018. In 2011, American hospitals spent over $41 billion on diabetic patients who got readmitted within 30 days of discharge. Being able to determine factors that lead to higher readmission in such patients, and correspondingly being able to predict which patients will get readmitted can help hospitals save millions of dollars while improving quality of care. So, with that background in mind, we used a medical claims dataset (description below), to answer these questions:
Problem statment¶
The goal is to predict the chances of readmission of patients to the hospital considering certain changes in medication procedures, length of stay at the hospital, HbA1c measurement, age group, and multiple factors.
Step 1 - Data Wrangling¶
(a) - Gathering data¶
Data Sets:
• The project utilizes the "Diabetes 130-US hospitals for years 1999-2008 Data Set" from the UCI Machine Learning Repository. This dataset contains a wealth of information about diabetic patients, including demographics, diagnoses, medications, and outcomes, making it well-suited for predictive modeling.
Data Sources:
• The dataset is sourced from the UCI Machine Learning Repository, a reputable source of publicly available datasets for research and educational purposes. Access to this dataset ensures compliance with data privacy regulations and ethical considerations, providing a reliable foundation for the project's analysis and modeling efforts.
Import Libraries¶
pip install plotly --upgrade
Requirement already satisfied: plotly in c:\users\mahen\anaconda3\anaconda\lib\site-packages (5.23.0) Requirement already satisfied: tenacity>=6.2.0 in c:\users\mahen\anaconda3\anaconda\lib\site-packages (from plotly) (8.2.2) Requirement already satisfied: packaging in c:\users\mahen\anaconda3\anaconda\lib\site-packages (from plotly) (23.2) Note: you may need to restart the kernel to use updated packages.
import pandas as pd
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
df = pd.read_csv("diabetic_data.csv")
df.head() # Have a look to the data
| encounter_id | patient_nbr | race | gender | age | weight | admission_type_id | discharge_disposition_id | admission_source_id | time_in_hospital | ... | citoglipton | insulin | glyburide-metformin | glipizide-metformin | glimepiride-pioglitazone | metformin-rosiglitazone | metformin-pioglitazone | change | diabetesMed | readmitted | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2278392 | 8222157 | Caucasian | Female | [0-10) | ? | 6 | 25 | 1 | 1 | ... | No | No | No | No | No | No | No | No | No | NO |
| 1 | 149190 | 55629189 | Caucasian | Female | [10-20) | ? | 1 | 1 | 7 | 3 | ... | No | Up | No | No | No | No | No | Ch | Yes | >30 |
| 2 | 64410 | 86047875 | AfricanAmerican | Female | [20-30) | ? | 1 | 1 | 7 | 2 | ... | No | No | No | No | No | No | No | No | Yes | NO |
| 3 | 500364 | 82442376 | Caucasian | Male | [30-40) | ? | 1 | 1 | 7 | 2 | ... | No | Up | No | No | No | No | No | Ch | Yes | NO |
| 4 | 16680 | 42519267 | Caucasian | Male | [40-50) | ? | 1 | 1 | 7 | 1 | ... | No | Steady | No | No | No | No | No | Ch | Yes | NO |
5 rows × 50 columns
df.columns
Index(['encounter_id', 'patient_nbr', 'race', 'gender', 'age', 'weight',
'admission_type_id', 'discharge_disposition_id', 'admission_source_id',
'time_in_hospital', 'payer_code', 'medical_specialty',
'num_lab_procedures', 'num_procedures', 'num_medications',
'number_outpatient', 'number_emergency', 'number_inpatient', 'diag_1',
'diag_2', 'diag_3', 'number_diagnoses', 'max_glu_serum', 'A1Cresult',
'metformin', 'repaglinide', 'nateglinide', 'chlorpropamide',
'glimepiride', 'acetohexamide', 'glipizide', 'glyburide', 'tolbutamide',
'pioglitazone', 'rosiglitazone', 'acarbose', 'miglitol', 'troglitazone',
'tolazamide', 'examide', 'citoglipton', 'insulin',
'glyburide-metformin', 'glipizide-metformin',
'glimepiride-pioglitazone', 'metformin-rosiglitazone',
'metformin-pioglitazone', 'change', 'diabetesMed', 'readmitted'],
dtype='object')
df.shape
(101766, 50)
(b) - Assessing Data¶
summary of data¶
- Introduction
The Diabetes 130-US Hospitals Dataset is a comprehensive dataset sourced from the UCI Machine Learning Repository, encompassing data collected over a decade (1999-2008) from 130 hospitals across the United States. The dataset contains valuable information on diabetic patients' healthcare, including demographics, diagnoses, treatments, and outcomes. With 101,766 patient records and 50 attributes, this dataset is an extensive resource for predictive modeling, particularly in predicting hospital readmissions among diabetic patients.
- Dataset Overview
The dataset consists of 101,766 rows (patient encounters) and 50 columns (features). Each row represents a unique patient encounter at a hospital, and the columns provide detailed information about the patient's demographics, medical conditions, hospital procedures, medications, and outcomes. The dataset is particularly rich in information that can be leveraged to predict whether a patient is likely to be readmitted to the hospital after discharge.
- Key Attributes:
Patient Demographics: Includes attributes like race, gender, age, and weight.
Hospital Admission Details: Captured through attributes such as admission_type_id, discharge_disposition_id, and admission_source_id.
Clinical Measurements: Features like time_in_hospital, num_lab_procedures, num_procedures, and num_medications reflect the intensity of care.
Medical Diagnoses: diag_1, diag_2, and diag_3 provide diagnostic codes for primary, secondary, and tertiary diagnoses.
Medications: Information on specific diabetes medications administered, such as metformin, insulin, and others.
Outcome: The readmitted column indicates whether the patient was readmitted to the hospital, which serves as the target variable for prediction.
Column descriptions¶
Encounter ID - Unique identifier of an encounter
Patient number -Unique identifier of a patient
Race Values - Caucasian,Asian,African,Hispanic and other
Gender Values - Male,female and unknown/invalid
Age - Grouped in 10-year interval: 0,10), 10,20),......,90,100)
Weight - weight in pound
Admission - Type integer identifier corresponding to 9 distinct values,for example,emergency,urgent,elective,newborn and not available
Discharge disposition - Integer identifier corresponding to 29 distinct values. for example ,discharge to home,expired,and not availabe
Admission source - Integer identifier corresponding to 21 distint values.for example, physician referral,emergency room and transfer from a hospital
Time in Hospital - Integer number of days between admission and discharge
Payer code - Integer identifier corresponding to 23 distint values,for example,Blue Cross/Blue Shield,Medicare and self-pay medical
Medical specialty - Integer identifier of a specialty of the admitting physician,corresponding to 84 distint values. for example cardiology,internal medicine,family/general practice and surgeon
Number of lab procedures - Number of lab tests performed during the encounter
Number of procedures - Numeric number of procedures(other than lab tests) performed during the encounter
Number of Medications - Numeric of distinct generic names administered during the encounter
Number of outpatient - visit number of outpatient visit of the patient in the year preceding the encounter
Number of Emergency - visit number of Emergency visit of the patient in the year preceding the encounter
Number of inpatient - visit number of inpatient visit of the patient in the year preceding the encounter
Diagnosis 1 - The primary diagnosis(coded as first three digits of ICD9): 848 Distinct values
Diagnosis 2 - The Secondary diagnosis(coded as first three digits of ICD9): 923 Distinct values
Diagnosis 3 - Additional secondary diagnosis(coded as first three digits of ICD9): 954 Distinct values
Number of diagnoses - Number of diagnoses entered to the system 0%
Glucose serum test result - Indicates the range of the result or if the test was not taken.values:">200",">300","normal" and "none"if not measured
A1c test result - indicates the range of the result or if the test was not taken.Values: ">8" if the result was greater than 8%,">7" if the result was greater than 7% but less than 8%,"normal" if the result was less than 7%,and "none" if not measures.
Change of medications - indicates if there was a change in diabetic medication (either doses or generic name),Values:"change" and "no change"
Diabetes medication - Indicates if there was any diabetic medication prescribed.values: "yes" and "no"
24 features - For medications for the generic name: metformin,repaglinide,nateglinide,chlorpropamide,glimepiride,acetohexamide,glipizide,glyburide,tolbutamide,pioglitazone,rosiglitazone,acarbose,miglitol,troglitazone,tolazamide,examide,citoglipton,insulin, glyburide-metformin,glipizide-metformin,glimepiride-pioglitazone,metformin-rosiglitazone and metformin-pioglitazone.The feature indicates whether the drug was prescribed or there was a change in the dosage. values "up" if the dosage was increased during the encounter,"down" if the dosage was decreased,"steady" if the dosage did not change, and "no" if the drug was not prescribed
Readmitted - Days to inpatient readmission,values: "<30" if the patient was readmitted in less than 30 days,">30" if the patient was readmitted in more than 30 days, and "No" for no record of readmission
Variables¶
- Patient identifiers - encounter_id,patient_nbr
- Patient demographics - race,gender,age,weight,payer_code
- Admission and Discharge details - admission_source_id,admission_type_id,discharge_disposition_id
- Patient medical history - number_outpatient,number_inpatient,number_emergency
- Patient Admission details - medical_speacialty,diag_1,diag_2,diag_3,time_in_hospital,number_diagnoses,num_lab_procedures,num_procedures,num_medications
- Clinical Result - max_glu_serum,A1cresult
- Medication Details - diabetesMed,change,23 feature for medications
- Readmission indicator - readmitted
ICD¶
- ICD-9 stands for the International Classification of Diseases, 9th Revision. It's a coding system used to classify diseases and other health problems for the purpose of billing, statistical analysis, and medical research. The ICD-9 system was developed by the World Health Organization (WHO) and it was used widely until it was replaced by ICD-10.
Max_Glu_Serum (Maximum Glucose Serum):¶
- This test measures the concentration of glucose (sugar) in the blood serum. It's commonly used to diagnose and monitor diabetes mellitus. Elevated levels of glucose in the blood can indicate diabetes or other metabolic disorders. The test is typically done after fasting for a certain period (Fasting Blood Glucose Test) or randomly (Random Blood Glucose Test).
A1C Result (Hemoglobin A1C):¶
This test measures the average blood glucose levels over the past two to three months. It provides an indication of long-term blood sugar control. The A1C test measures the percentage of hemoglobin (a protein in red blood cells) that has glucose attached to it. It's commonly used to monitor the effectiveness of diabetes treatment and to assess the risk of diabetes-related complications. A lower A1C level indicates better blood sugar control.
Both of these tests are important tools in the diagnosis and management of diabetes mellitus. They help healthcare providers assess blood sugar levels over different time frames and make informed decisions about treatment and lifestyle modifications.
df.tail()
| encounter_id | patient_nbr | race | gender | age | weight | admission_type_id | discharge_disposition_id | admission_source_id | time_in_hospital | ... | citoglipton | insulin | glyburide-metformin | glipizide-metformin | glimepiride-pioglitazone | metformin-rosiglitazone | metformin-pioglitazone | change | diabetesMed | readmitted | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 101761 | 443847548 | 100162476 | AfricanAmerican | Male | [70-80) | ? | 1 | 3 | 7 | 3 | ... | No | Down | No | No | No | No | No | Ch | Yes | >30 |
| 101762 | 443847782 | 74694222 | AfricanAmerican | Female | [80-90) | ? | 1 | 4 | 5 | 5 | ... | No | Steady | No | No | No | No | No | No | Yes | NO |
| 101763 | 443854148 | 41088789 | Caucasian | Male | [70-80) | ? | 1 | 1 | 7 | 1 | ... | No | Down | No | No | No | No | No | Ch | Yes | NO |
| 101764 | 443857166 | 31693671 | Caucasian | Female | [80-90) | ? | 2 | 3 | 7 | 10 | ... | No | Up | No | No | No | No | No | Ch | Yes | NO |
| 101765 | 443867222 | 175429310 | Caucasian | Male | [70-80) | ? | 1 | 1 | 7 | 6 | ... | No | No | No | No | No | No | No | No | No | NO |
5 rows × 50 columns
df.sample(5)
| encounter_id | patient_nbr | race | gender | age | weight | admission_type_id | discharge_disposition_id | admission_source_id | time_in_hospital | ... | citoglipton | insulin | glyburide-metformin | glipizide-metformin | glimepiride-pioglitazone | metformin-rosiglitazone | metformin-pioglitazone | change | diabetesMed | readmitted | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 20228 | 71473086 | 23488398 | Caucasian | Male | [50-60) | ? | 5 | 1 | 17 | 1 | ... | No | Steady | No | No | No | No | No | No | Yes | NO |
| 15448 | 59283900 | 16150374 | Caucasian | Male | [40-50) | ? | 3 | 18 | 1 | 2 | ... | No | Steady | No | No | No | No | No | Ch | Yes | NO |
| 5848 | 29848692 | 3031074 | Caucasian | Female | [60-70) | ? | 2 | 1 | 1 | 11 | ... | No | No | No | No | No | No | No | Ch | Yes | NO |
| 88891 | 286527180 | 38637351 | AfricanAmerican | Female | [70-80) | ? | 3 | 1 | 1 | 1 | ... | No | No | No | No | No | No | No | No | No | NO |
| 29594 | 96544062 | 61080255 | Caucasian | Male | [70-80) | ? | 1 | 1 | 7 | 4 | ... | No | Steady | No | No | No | No | No | No | Yes | <30 |
5 rows × 50 columns
# Checking Dimension of the Data
df.shape
(101766, 50)
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 101766 entries, 0 to 101765 Data columns (total 50 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 encounter_id 101766 non-null int64 1 patient_nbr 101766 non-null int64 2 race 101766 non-null object 3 gender 101766 non-null object 4 age 101766 non-null object 5 weight 101766 non-null object 6 admission_type_id 101766 non-null int64 7 discharge_disposition_id 101766 non-null int64 8 admission_source_id 101766 non-null int64 9 time_in_hospital 101766 non-null int64 10 payer_code 101766 non-null object 11 medical_specialty 101766 non-null object 12 num_lab_procedures 101766 non-null int64 13 num_procedures 101766 non-null int64 14 num_medications 101766 non-null int64 15 number_outpatient 101766 non-null int64 16 number_emergency 101766 non-null int64 17 number_inpatient 101766 non-null int64 18 diag_1 101766 non-null object 19 diag_2 101766 non-null object 20 diag_3 101766 non-null object 21 number_diagnoses 101766 non-null int64 22 max_glu_serum 5346 non-null object 23 A1Cresult 17018 non-null object 24 metformin 101766 non-null object 25 repaglinide 101766 non-null object 26 nateglinide 101766 non-null object 27 chlorpropamide 101766 non-null object 28 glimepiride 101766 non-null object 29 acetohexamide 101766 non-null object 30 glipizide 101766 non-null object 31 glyburide 101766 non-null object 32 tolbutamide 101766 non-null object 33 pioglitazone 101766 non-null object 34 rosiglitazone 101766 non-null object 35 acarbose 101766 non-null object 36 miglitol 101766 non-null object 37 troglitazone 101766 non-null object 38 tolazamide 101766 non-null object 39 examide 101766 non-null object 40 citoglipton 101766 non-null object 41 insulin 101766 non-null object 42 glyburide-metformin 101766 non-null object 43 glipizide-metformin 101766 non-null object 44 glimepiride-pioglitazone 101766 non-null object 45 metformin-rosiglitazone 101766 non-null object 46 metformin-pioglitazone 101766 non-null object 47 change 101766 non-null object 48 diabetesMed 101766 non-null object 49 readmitted 101766 non-null object dtypes: int64(13), object(37) memory usage: 38.8+ MB
- There are 101766 samples present in this dataset
print("According to the data, there are a total of 50 columns")
# Select categorical columns
df_cat = df.select_dtypes(include='object')
# Select numerical columns
df_num = df.select_dtypes(include=np.number)
# Print the number and names of categorical columns
print('Total categorical columns: ', len(df_cat.columns))
print('\nCategorical column names: ', df_cat.columns.tolist())
print('\n')
# Print the number and names of numerical columns
print('Total numerical columns: ', len(df_num.columns))
print('\nNumerical column names: ', df_num.columns.tolist())
According to the data, there are a total of 50 columns Total categorical columns: 37 Categorical column names: ['race', 'gender', 'age', 'weight', 'payer_code', 'medical_specialty', 'diag_1', 'diag_2', 'diag_3', 'max_glu_serum', 'A1Cresult', 'metformin', 'repaglinide', 'nateglinide', 'chlorpropamide', 'glimepiride', 'acetohexamide', 'glipizide', 'glyburide', 'tolbutamide', 'pioglitazone', 'rosiglitazone', 'acarbose', 'miglitol', 'troglitazone', 'tolazamide', 'examide', 'citoglipton', 'insulin', 'glyburide-metformin', 'glipizide-metformin', 'glimepiride-pioglitazone', 'metformin-rosiglitazone', 'metformin-pioglitazone', 'change', 'diabetesMed', 'readmitted'] Total numerical columns: 13 Numerical column names: ['encounter_id', 'patient_nbr', 'admission_type_id', 'discharge_disposition_id', 'admission_source_id', 'time_in_hospital', 'num_lab_procedures', 'num_procedures', 'num_medications', 'number_outpatient', 'number_emergency', 'number_inpatient', 'number_diagnoses']
# Checking whether our data contains any missing value or not
df.isnull().sum()
encounter_id 0 patient_nbr 0 race 0 gender 0 age 0 weight 0 admission_type_id 0 discharge_disposition_id 0 admission_source_id 0 time_in_hospital 0 payer_code 0 medical_specialty 0 num_lab_procedures 0 num_procedures 0 num_medications 0 number_outpatient 0 number_emergency 0 number_inpatient 0 diag_1 0 diag_2 0 diag_3 0 number_diagnoses 0 max_glu_serum 96420 A1Cresult 84748 metformin 0 repaglinide 0 nateglinide 0 chlorpropamide 0 glimepiride 0 acetohexamide 0 glipizide 0 glyburide 0 tolbutamide 0 pioglitazone 0 rosiglitazone 0 acarbose 0 miglitol 0 troglitazone 0 tolazamide 0 examide 0 citoglipton 0 insulin 0 glyburide-metformin 0 glipizide-metformin 0 glimepiride-pioglitazone 0 metformin-rosiglitazone 0 metformin-pioglitazone 0 change 0 diabetesMed 0 readmitted 0 dtype: int64
- There is no null values
# check if there are any special characters in the data or not
for i in df.columns:
if df[i].dtype == object:
print(i,df[i][df[i] == '?'].count())
race 2273 gender 0 age 0 weight 98569 payer_code 40256 medical_specialty 49949 diag_1 21 diag_2 358 diag_3 1423 max_glu_serum 0 A1Cresult 0 metformin 0 repaglinide 0 nateglinide 0 chlorpropamide 0 glimepiride 0 acetohexamide 0 glipizide 0 glyburide 0 tolbutamide 0 pioglitazone 0 rosiglitazone 0 acarbose 0 miglitol 0 troglitazone 0 tolazamide 0 examide 0 citoglipton 0 insulin 0 glyburide-metformin 0 glipizide-metformin 0 glimepiride-pioglitazone 0 metformin-rosiglitazone 0 metformin-pioglitazone 0 change 0 diabetesMed 0 readmitted 0
for i in df.columns:
print(i,df[i][df[i]=='?'].count()/len(df)*100) # checking in percentage
encounter_id 0.0 patient_nbr 0.0 race 2.2335554114340743 gender 0.0 age 0.0 weight 96.85847925633315 admission_type_id 0.0 discharge_disposition_id 0.0 admission_source_id 0.0 time_in_hospital 0.0 payer_code 39.5574160328597 medical_specialty 49.08220820313268 num_lab_procedures 0.0 num_procedures 0.0 num_medications 0.0 number_outpatient 0.0 number_emergency 0.0 number_inpatient 0.0 diag_1 0.02063557573256294 diag_2 0.3517874339170253 diag_3 1.398305917497003 number_diagnoses 0.0 max_glu_serum 0.0 A1Cresult 0.0 metformin 0.0 repaglinide 0.0 nateglinide 0.0 chlorpropamide 0.0 glimepiride 0.0 acetohexamide 0.0 glipizide 0.0 glyburide 0.0 tolbutamide 0.0 pioglitazone 0.0 rosiglitazone 0.0 acarbose 0.0 miglitol 0.0 troglitazone 0.0 tolazamide 0.0 examide 0.0 citoglipton 0.0 insulin 0.0 glyburide-metformin 0.0 glipizide-metformin 0.0 glimepiride-pioglitazone 0.0 metformin-rosiglitazone 0.0 metformin-pioglitazone 0.0 change 0.0 diabetesMed 0.0 readmitted 0.0
# Displaying the statistical overview of the data
df.describe()
| encounter_id | patient_nbr | admission_type_id | discharge_disposition_id | admission_source_id | time_in_hospital | num_lab_procedures | num_procedures | num_medications | number_outpatient | number_emergency | number_inpatient | number_diagnoses | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 1.017660e+05 | 1.017660e+05 | 101766.000000 | 101766.000000 | 101766.000000 | 101766.000000 | 101766.000000 | 101766.000000 | 101766.000000 | 101766.000000 | 101766.000000 | 101766.000000 | 101766.000000 |
| mean | 1.652016e+08 | 5.433040e+07 | 2.024006 | 3.715642 | 5.754437 | 4.395987 | 43.095641 | 1.339730 | 16.021844 | 0.369357 | 0.197836 | 0.635566 | 7.422607 |
| std | 1.026403e+08 | 3.869636e+07 | 1.445403 | 5.280166 | 4.064081 | 2.985108 | 19.674362 | 1.705807 | 8.127566 | 1.267265 | 0.930472 | 1.262863 | 1.933600 |
| min | 1.252200e+04 | 1.350000e+02 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |
| 25% | 8.496119e+07 | 2.341322e+07 | 1.000000 | 1.000000 | 1.000000 | 2.000000 | 31.000000 | 0.000000 | 10.000000 | 0.000000 | 0.000000 | 0.000000 | 6.000000 |
| 50% | 1.523890e+08 | 4.550514e+07 | 1.000000 | 1.000000 | 7.000000 | 4.000000 | 44.000000 | 1.000000 | 15.000000 | 0.000000 | 0.000000 | 0.000000 | 8.000000 |
| 75% | 2.302709e+08 | 8.754595e+07 | 3.000000 | 4.000000 | 7.000000 | 6.000000 | 57.000000 | 2.000000 | 20.000000 | 0.000000 | 0.000000 | 1.000000 | 9.000000 |
| max | 4.438672e+08 | 1.895026e+08 | 8.000000 | 28.000000 | 25.000000 | 14.000000 | 132.000000 | 6.000000 | 81.000000 | 42.000000 | 76.000000 | 21.000000 | 16.000000 |
df.duplicated().sum()
0
There are total 50 features looking into all of these:¶¶
#1.encounter_id
df['encounter_id'].nunique()
101766
has_negative = (df['encounter_id'] < 0).any()
print(has_negative)
False
df['encounter_id'].dtypes
dtype('int64')
df['encounter_id'].isnull().sum()
0
Patients that have come to the hospital have been provided with different unique encounter_id .so that's why there are 101766 values.
Nothing problem with this column in the point of data clening
#2. patient_nbr
df['patient_nbr'].nunique()
71518
- There are 71518 unique paitents that have entered the system
- encounter_id and patient_nbr these are just identifiers and not useful variables so they can be dropped from our future analysis
# 3 race
df['race'].value_counts()
race Caucasian 76099 AfricanAmerican 19210 ? 2273 Hispanic 2037 Other 1506 Asian 641 Name: count, dtype: int64
df['race'].isnull().sum()
0
df['race'].dtypes
dtype('O')
- In 'race' column there are 2273 which is 2.23% value filled with '?'
#4.Gender
df['gender'].value_counts()
gender Female 54708 Male 47055 Unknown/Invalid 3 Name: count, dtype: int64
df['gender'].dtypes
dtype('O')
df['gender'].isnull().sum()
0
- In this column there are 3 value filled with 'Unknown/Invalid'
#5.age
df['age'].value_counts()
age [70-80) 26068 [60-70) 22483 [50-60) 17256 [80-90) 17197 [40-50) 9685 [30-40) 3775 [90-100) 2793 [20-30) 1657 [10-20) 691 [0-10) 161 Name: count, dtype: int64
df['age'].isnull().sum()
0
df['age'].dtypes
dtype('O')
It has data type showing 'Object' but it should be 'integer'
Also ages are in 10 year interval
# 6.weight
df['weight'].value_counts()
weight ? 98569 [75-100) 1336 [50-75) 897 [100-125) 625 [125-150) 145 [25-50) 97 [0-25) 48 [150-175) 35 [175-200) 11 >200 3 Name: count, dtype: int64
df['weight'].dtypes
dtype('O')
- This column contance 98569 '?' value which is 96.85%
- It contace 25kg weight interval staring from 0 to 200.
- Data type 'Object'
# 7.admission_type_id
df['admission_type_id'].value_counts()
admission_type_id 1 53990 3 18869 2 18480 6 5291 5 4785 8 320 7 21 4 10 Name: count, dtype: int64
df['admission_type_id'].isnull().sum()
0
df['admission_type_id'].dtypes
dtype('int64')
- This columns dont have any problem in the point of datacleaning
# 8.discharge_disposition_id
df['discharge_disposition_id'].value_counts()
discharge_disposition_id 1 60234 3 13954 6 12902 18 3691 2 2128 22 1993 11 1642 5 1184 25 989 4 815 7 623 23 412 13 399 14 372 28 139 8 108 15 63 24 48 9 21 17 14 16 11 19 8 10 6 27 5 12 3 20 2 Name: count, dtype: int64
df['discharge_disposition_id'].dtypes
dtype('int64')
df['discharge_disposition_id'].isnull().sum()
0
- This columns dont have any problem in the point of datacleaning
# 9.admission_source_id
df['admission_source_id'].value_counts()
admission_source_id 7 57494 1 29565 17 6781 4 3187 6 2264 2 1104 5 855 3 187 20 161 9 125 8 16 22 12 10 8 14 2 11 2 25 2 13 1 Name: count, dtype: int64
df['admission_source_id'].dtypes
dtype('int64')
df['admission_source_id'].isnull().sum()
0
- This columns dont have any problem in the point of datacleaning
# 10.time_in_hospital
df['time_in_hospital'].value_counts()
time_in_hospital 3 17756 2 17224 1 14208 4 13924 5 9966 6 7539 7 5859 8 4391 9 3002 10 2342 11 1855 12 1448 13 1210 14 1042 Name: count, dtype: int64
df['time_in_hospital'].isnull().sum()
0
df['time_in_hospital'].dtypes
dtype('int64')
- This columns dont have any problem in the point of datacleaning
# 11. payer_code
df['payer_code'].isnull().sum()
0
df['payer_code'].dtypes
dtype('O')
df['payer_code'].value_counts()
payer_code ? 40256 MC 32439 HM 6274 SP 5007 BC 4655 MD 3532 CP 2533 UN 2448 CM 1937 OG 1033 PO 592 DM 549 CH 146 WC 135 OT 95 MP 79 SI 55 FR 1 Name: count, dtype: int64
- It contance 40256 (39.55%) value as '?'
# 12. medical_specialty
df['medical_specialty'].value_counts()
medical_specialty
? 49949
InternalMedicine 14635
Emergency/Trauma 7565
Family/GeneralPractice 7440
Cardiology 5352
...
SportsMedicine 1
Speech 1
Perinatology 1
Neurophysiology 1
Pediatrics-InfectiousDiseases 1
Name: count, Length: 73, dtype: int64
df['medical_specialty'].isnull().sum()
0
df['medical_specialty'].dtypes
dtype('O')
- It contance 49949 (49.08%) value as '?'
# 13. num_lab_procedures
df['num_lab_procedures'].value_counts()
num_lab_procedures
1 3208
43 2804
44 2496
45 2376
38 2213
...
120 1
132 1
121 1
126 1
118 1
Name: count, Length: 118, dtype: int64
- Not any problem in terms of Data Cleaning
# 14. num_procedures
df['num_procedures'].value_counts()
num_procedures 0 46652 1 20742 2 12717 3 9443 6 4954 4 4180 5 3078 Name: count, dtype: int64
- Not any problem in terms of Data Cleaning
# 15. num_medications
df['num_medications'].value_counts()
num_medications
13 6086
12 6004
11 5795
15 5792
14 5707
...
70 2
75 2
81 1
79 1
74 1
Name: count, Length: 75, dtype: int64
- Not any problem in terms of Data Cleaning
# 16. number_outpatient
df['number_outpatient'].isnull().sum()
0
df['number_outpatient'].dtypes
dtype('int64')
df['number_outpatient'].value_counts()
number_outpatient 0 85027 1 8547 2 3594 3 2042 4 1099 5 533 6 303 7 155 8 98 9 83 10 57 11 42 13 31 12 30 14 28 15 20 16 15 17 8 21 7 20 7 18 5 22 5 19 3 27 3 24 3 26 2 23 2 25 2 33 2 35 2 36 2 29 2 34 1 39 1 42 1 28 1 37 1 38 1 40 1 Name: count, dtype: int64
- 85027 values filled with 0
# 17. number_emergency
df['number_emergency'].value_counts()
number_emergency 0 90383 1 7677 2 2042 3 725 4 374 5 192 6 94 7 73 8 50 10 34 9 33 11 23 13 12 12 10 22 6 16 5 18 5 19 4 20 4 15 3 14 3 25 2 21 2 28 1 42 1 46 1 76 1 37 1 64 1 63 1 54 1 24 1 29 1 Name: count, dtype: int64
df['number_emergency'].isnull().sum()
0
df['number_emergency'].dtypes
dtype('int64')
- 90383 values filled with 0
#18. number_inpatient
df['number_inpatient'].value_counts()
number_inpatient 0 67630 1 19521 2 7566 3 3411 4 1622 5 812 6 480 7 268 8 151 9 111 10 61 11 49 12 34 13 20 14 10 15 9 16 6 19 2 17 1 21 1 18 1 Name: count, dtype: int64
df['number_inpatient'].dtypes
dtype('int64')
df['number_inpatient'].isnull().sum()
0
#19. diag_1
df['diag_1'].value_counts()
diag_1
428 6862
414 6581
786 4016
410 3614
486 3508
...
373 1
314 1
684 1
217 1
V51 1
Name: count, Length: 717, dtype: int64
# Count the values in 'diag_1' that start with any letter of the English alphabet
count_alpha_start = df['diag_1'].str.match('^[A-Za-z]').sum()
print(f"Number of values starting with a letter: {count_alpha_start}")
Number of values starting with a letter: 1645
count_values_starting_with_V = df['diag_1'].str.startswith('V').sum()
print("Number of values starting with 'V':", count_values_starting_with_V)
Number of values starting with 'V': 1644
count_values_starting_with_E = df['diag_1'].str.startswith('E').sum()
print("Number of values starting with 'E':", count_values_starting_with_E)
Number of values starting with 'E': 1
df['diag_1'].isnull().sum()
0
df['diag_1'].dtypes
dtype('O')
- Data types is object
- 21 value filled with '?'
- There are 1644 values start with 'V'
- There are 1 values start with 'E'
- Some decimal values is there
#20. diag_2
df['diag_2'].value_counts()
diag_2
276 6752
428 6662
250 6071
427 5036
401 3736
...
E918 1
46 1
V13 1
E850 1
927 1
Name: count, Length: 749, dtype: int64
# Count the values in 'diag_2' that start with any letter of the English alphabet
count_alpha_start = df['diag_2'].str.match('^[A-Za-z]').sum()
print(f"Number of values starting with a letter: {count_alpha_start}")
Number of values starting with a letter: 2536
count_values_starting_with_V = df['diag_2'].str.startswith('V').sum()
print("Number of values starting with 'V':", count_values_starting_with_V)
Number of values starting with 'V': 1805
count_values_starting_with_E = df['diag_2'].str.startswith('E').sum()
print("Number of values starting with 'E':", count_values_starting_with_E)
Number of values starting with 'E': 731
df['diag_2'].isnull().sum()
0
df['diag_2'].dtypes
dtype('O')
- Data types is object
- 358 value filled with '?'
- There are 1805 values start with 'V'
- There are 731 values start with 'E'
- Some decimal values is there
# 21. Diag_3
df['diag_3'].value_counts()
diag_3
250 11555
401 8289
276 5175
428 4577
427 3955
...
657 1
684 1
603 1
E826 1
971 1
Name: count, Length: 790, dtype: int64
df['diag_3'].dtypes
dtype('O')
df['diag_3'].isnull().sum()
0
# Count the values in 'diag_3' that start with any letter of the English alphabet
count_alpha_start = df['diag_3'].str.match('^[A-Za-z]').sum()
print(f"Number of values starting with a letter: {count_alpha_start}")
Number of values starting with a letter: 5058
count_values_starting_with_V = df['diag_3'].str.startswith('V').sum()
print("Number of values starting with 'V':", count_values_starting_with_V)
Number of values starting with 'V': 3814
count_values_starting_with_E = df['diag_3'].str.startswith('E').sum()
print("Number of values starting with 'E':", count_values_starting_with_E)
Number of values starting with 'E': 1244
- Data types is object
- 1423 value filled with '?'
- There are 3814 values start with 'V'
- There are 1244 values start with 'E'
- Some decimal values is there
# 22.number_diagnoses
df['number_diagnoses'].value_counts()
number_diagnoses 9 49474 5 11393 8 10616 7 10393 6 10161 4 5537 3 2835 2 1023 1 219 16 45 10 17 13 16 11 11 15 10 12 9 14 7 Name: count, dtype: int64
df['number_diagnoses'].dtypes
dtype('int64')
df['number_diagnoses'].isnull().sum()
0
- There is no change required in terms of data cleaning
# 23. max_glu_serum
df['max_glu_serum'].value_counts()
max_glu_serum Norm 2597 >200 1485 >300 1264 Name: count, dtype: int64
df['max_glu_serum'].isnull().sum()
96420
df['max_glu_serum'].dtypes
dtype('O')
- There is no change required in terms of data cleaning
# 24. A1Cresult
df['A1Cresult'].value_counts()
A1Cresult >8 8216 Norm 4990 >7 3812 Name: count, dtype: int64
df['A1Cresult'].isnull().sum()
84748
df['A1Cresult'].dtypes
dtype('O')
- There is no change required in terms of data cleaning
# 25. metformin
df['metformin'].value_counts()
metformin No 81778 Steady 18346 Up 1067 Down 575 Name: count, dtype: int64
df['metformin'].dtypes
dtype('O')
# 26.repaglinide
df['repaglinide'].value_counts()
repaglinide No 100227 Steady 1384 Up 110 Down 45 Name: count, dtype: int64
- No Change required in terms of data cleaning
# 27.nateglinide
df['nateglinide'].value_counts()
nateglinide No 101063 Steady 668 Up 24 Down 11 Name: count, dtype: int64
- No Change required in terms of data cleaning
# 28. chlorpropamide
df['chlorpropamide'].value_counts()
chlorpropamide No 101680 Steady 79 Up 6 Down 1 Name: count, dtype: int64
- No Change required in terms of data cleaning
# 29. glimepiride
df['glimepiride'].value_counts()
glimepiride No 96575 Steady 4670 Up 327 Down 194 Name: count, dtype: int64
- No Change required in terms of data cleaning
# 30. acetohexamide
df['acetohexamide'].value_counts()
acetohexamide No 101765 Steady 1 Name: count, dtype: int64
- In this case acetohexamide is given to only one patient.
# 31. glipizide
df['glipizide'].value_counts()
glipizide No 89080 Steady 11356 Up 770 Down 560 Name: count, dtype: int64
- no change tequired in terms of data cleaning
# 32. glyburide
df['glyburide'].value_counts()
glyburide No 91116 Steady 9274 Up 812 Down 564 Name: count, dtype: int64
- No change required in terms of data cleaning
# 33. tolbutamide
df['tolbutamide'].value_counts()
tolbutamide No 101743 Steady 23 Name: count, dtype: int64
- Only 23 patient this medicine was given
# 34. pioglitazone
df['pioglitazone'].value_counts()
pioglitazone No 94438 Steady 6976 Up 234 Down 118 Name: count, dtype: int64
- No change required in terms of data clening
# 35. rosiglitazone
df['rosiglitazone'].value_counts()
rosiglitazone No 95401 Steady 6100 Up 178 Down 87 Name: count, dtype: int64
- No change required in terms of data cleaning
# 36. acarbose
df['acarbose'].value_counts()
acarbose No 101458 Steady 295 Up 10 Down 3 Name: count, dtype: int64
- only 308 patients was taken this medicine
# 37. miglitol
df['miglitol'].value_counts()
miglitol No 101728 Steady 31 Down 5 Up 2 Name: count, dtype: int64
- only 38 patients was taken this medicine
# 38. troglitazone
df['troglitazone'].value_counts()
troglitazone No 101763 Steady 3 Name: count, dtype: int64
- only 3 patients was taken this medicine
# 39.tolazamide
df['tolazamide'].value_counts()
tolazamide No 101727 Steady 38 Up 1 Name: count, dtype: int64
- only 39 patients was taken this medicine
# 40. examide
df['examide'].value_counts()
examide No 101766 Name: count, dtype: int64
- no one taken this medicine
#41. citoglipton
df['citoglipton'].value_counts()
citoglipton No 101766 Name: count, dtype: int64
- no one taken this medicine
# 42. insulin
df['insulin'].value_counts()
insulin No 47383 Steady 30849 Down 12218 Up 11316 Name: count, dtype: int64
- Nothing to change for this column as per data clening
# 43. glyburide-metformin
df['glyburide-metformin'].value_counts()
glyburide-metformin No 101060 Steady 692 Up 8 Down 6 Name: count, dtype: int64
- only 706 patient was taken this medicine
# 44. glipizide-metformin
df['glipizide-metformin'].value_counts()
glipizide-metformin No 101753 Steady 13 Name: count, dtype: int64
- only 13 patient was taken this medicine
# 45. glimepiride-pioglitazone
df['glimepiride-pioglitazone'].value_counts()
glimepiride-pioglitazone No 101765 Steady 1 Name: count, dtype: int64
- only one patient was taken this medicine
# 46. metformin-rosiglitazone
df['metformin-rosiglitazone'].value_counts()
metformin-rosiglitazone No 101764 Steady 2 Name: count, dtype: int64
- only 2 patient was taken this medicine
# 47. metformin-pioglitazone
df['metformin-pioglitazone'].value_counts()
metformin-pioglitazone No 101765 Steady 1 Name: count, dtype: int64
- only one patient was taken this medicine
# 48. change
df['change'].value_counts()
change No 54755 Ch 47011 Name: count, dtype: int64
- Nothing to change for this column in terms of data clening
# 49.diabetesMed
df['diabetesMed'].value_counts()
diabetesMed Yes 78363 No 23403 Name: count, dtype: int64
- no need to change anything for this columns
# 50. readmitted
df['readmitted'].value_counts()
readmitted NO 54864 >30 35545 <30 11357 Name: count, dtype: int64
- No need to change anything for this columns in the point of data cleaning
Issues with the Dataset¶
'encounter_id' and 'patient_nbr' both are patient identifier
In 'race' column there are 2273 which is 2.23% value filled with '?'
In 'gender' column there are 3 value filled with ''Unknown/Invalid'
In 'age' column the dataset only gives us age as 10 year interval(from 0 to 100). so we don't know the exact age of each patient and data type also object.
- This column contance 98569 '?' value which is 96.85%
- It contace 25kg weight interval staring from 0 to 200.
- Data type 'Object'
In 'payer_code' column .It contance 40256 (39.55%) value as '?'
In 'medical_specialty' column 49949 (49.08%) value as '?'
In 'number_outpatient' column 85027 values filled with 0
In 'number_emergency' column 90383 values filled with 0
In 'diag_1' column
- Data types is object
- 21 value filled with '?'
- There are 1644 values start with 'V'
- There are 1 values start with 'E'
- Some decimal values is there
- In 'diag_2' column
- Data types is object
- 358 value filled with '?'
- There are 1805 values start with 'V'
- There are 731 values start with 'E'
- Some decimal values is there
- In 'diag_3' column
- Data types is object
- 1423 value filled with '?'
- There are 3814 values start with 'V'
- There are 1244 values start with 'E'
- Some decimal values is there
'acetohexamide' is give to only one patient
'tolbutamide' Only 23 patient this medicine was given
'acarbose' only 308 patient wa taken this medicine
'miglitol' only 38 patient was given to this medicine
'troglitazone' only 3 patients was taken this medicine
'tolazamide' only 39 patient was taken to this medicine
'examide' no one taken this medicine
'citoglipton' no one taken this medicine
'glyburide-metformin' only 706 patient was taken this medicine
'glipizide-metformin' only 13 patient was taken this medicine
'glimepiride-pioglitazone' only one patient was taken this medicine
'metformin-rosiglitazone' only 2 patient was taken this medicine
'metformin-pioglitazone' only one patient was taken this medicine
(C) - Data Cleaning¶
# Make a copy of the dataset
patients_df = df.copy
- In this dataset missing values are represented as "?" sign
- weight has 96.85%,payer_code has 39.55% and medical_specialty has 49.08% missing values that means we can drop this columns
# dropping columns with large number of missing values
df = df.drop(['weight','payer_code','medical_specialty'], axis=1)
df.shape
(101766, 47)
There are total 50 features looking into all of these:¶¶
#1.encounter_id
df['encounter_id'].nunique()
101766
- Patients that have come to the hospital have been provided with different unique encounter_id .so that's why there are 101766 values.
#2. patient_nbr
df['patient_nbr'].nunique()
71518
- There are 71518 unique paitents that have entered the system
- encounter_id and patient_nbr these are just identifiers and not useful variables so they can be dropped from our future analysis
#droping not useful variables
df.drop(['encounter_id','patient_nbr'],axis=1,inplace=True)
# 3 race
df['race'].value_counts()
race Caucasian 76099 AfricanAmerican 19210 ? 2273 Hispanic 2037 Other 1506 Asian 641 Name: count, dtype: int64
- There are unknown values in the form of '?' which we can replaced with nan values.so that later these values can be filled
df['race'] = df['race'].replace('?',np.nan)
df['race'].unique()
array(['Caucasian', 'AfricanAmerican', nan, 'Other', 'Asian', 'Hispanic'],
dtype=object)
df['race']=df['race'].fillna(df['race'].mode()[0]) #replacing with mode
df['race'].value_counts()
race Caucasian 78372 AfricanAmerican 19210 Hispanic 2037 Other 1506 Asian 641 Name: count, dtype: int64
#4.Gender
df['gender'].value_counts()
gender Female 54708 Male 47055 Unknown/Invalid 3 Name: count, dtype: int64
- There is a 3rd category of unknown/invalid which are missing/unknown values as there count is less so they can be dropped
df['gender']=df['gender'].replace('Unknown/Invalid',np.nan)
df['gender'].dropna(inplace=True)
df['gender'].value_counts()
gender Female 54708 Male 47055 Name: count, dtype: int64
df['gender'].unique()
array(['Female', 'Male', nan], dtype=object)
df['gender'] = df['gender'].fillna(df['gender'].mode()[0])
df['gender'].value_counts()
gender Female 54711 Male 47055 Name: count, dtype: int64
df.replace('?',np.nan,inplace=True)
df.isnull().sum()
race 0 gender 0 age 0 admission_type_id 0 discharge_disposition_id 0 admission_source_id 0 time_in_hospital 0 num_lab_procedures 0 num_procedures 0 num_medications 0 number_outpatient 0 number_emergency 0 number_inpatient 0 diag_1 21 diag_2 358 diag_3 1423 number_diagnoses 0 max_glu_serum 96420 A1Cresult 84748 metformin 0 repaglinide 0 nateglinide 0 chlorpropamide 0 glimepiride 0 acetohexamide 0 glipizide 0 glyburide 0 tolbutamide 0 pioglitazone 0 rosiglitazone 0 acarbose 0 miglitol 0 troglitazone 0 tolazamide 0 examide 0 citoglipton 0 insulin 0 glyburide-metformin 0 glipizide-metformin 0 glimepiride-pioglitazone 0 metformin-rosiglitazone 0 metformin-pioglitazone 0 change 0 diabetesMed 0 readmitted 0 dtype: int64
#5.age
df['age'].value_counts()
age [70-80) 26068 [60-70) 22483 [50-60) 17256 [80-90) 17197 [40-50) 9685 [30-40) 3775 [90-100) 2793 [20-30) 1657 [10-20) 691 [0-10) 161 Name: count, dtype: int64
- It shows the age range between 0 to 100
- Most of the population come under 50 to 80
- Dealing with age.The dataset only gives us age as 10 year interval. so we don't know the exact age of each patient
- We assume age of the patient on average lies at the midpoint of the age category.For eg: If the patients age category is 40-50 years, Then we assume the age = 45 years.
- So we converted age categories to midpoints
df['age'].unique()
array(['[0-10)', '[10-20)', '[20-30)', '[30-40)', '[40-50)', '[50-60)',
'[60-70)', '[70-80)', '[80-90)', '[90-100)'], dtype=object)
df['age']=df['age'].replace({'[0-10)':5,'[10-20)':15,'[20-30)':25,'[30-40)':35,'[40-50)':45,'[50-60)':55,
'[60-70)':65,'[70-80)':75,'[80-90)':85,'[90-100)':95})
df.age.value_counts()
age 75 26068 65 22483 55 17256 85 17197 45 9685 35 3775 95 2793 25 1657 15 691 5 161 Name: count, dtype: int64
#6.admission_type_id
df['admission_type_id'].value_counts()
admission_type_id 1 53990 3 18869 2 18480 6 5291 5 4785 8 320 7 21 4 10 Name: count, dtype: int64
#7.max_glu_serum
df.max_glu_serum.value_counts()
max_glu_serum Norm 2597 >200 1485 >300 1264 Name: count, dtype: int64
df.max_glu_serum.unique()
array([nan, '>300', 'Norm', '>200'], dtype=object)
df['max_glu_serum'].fillna('Test Not Done', inplace=True)
df.max_glu_serum.value_counts()
max_glu_serum Test Not Done 96420 Norm 2597 >200 1485 >300 1264 Name: count, dtype: int64
# A1Cresult
df.A1Cresult.value_counts()
A1Cresult >8 8216 Norm 4990 >7 3812 Name: count, dtype: int64
df.A1Cresult.unique()
array([nan, '>7', '>8', 'Norm'], dtype=object)
df['A1Cresult'].fillna('Test Not Done', inplace=True)
df.A1Cresult.value_counts()
A1Cresult Test Not Done 84748 >8 8216 Norm 4990 >7 3812 Name: count, dtype: int64
df.shape
(101766, 45)
EDA¶
Handling outliers¶
# Create a boxplot for each numerical column
plt.figure(figsize=(10, 8))
sns.boxplot(data=df)
plt.xticks(rotation=45) # Rotate x-axis labels for better readability
plt.show()
Preserving outliers in these columns is vital as they exhibit a strong correlation with the target variable in healthcare analytics. Outliers in healthcare data often represent extreme cases or rare conditions, which hold significant clinical relevance and predictive power. Removing them risks distorting the true nature of patient outcomes, treatment effectiveness, or disease progression, compromising the accuracy and interpretability of predictive models. By retaining these outliers, we ensure the model captures the full spectrum of patient experiences, enabling more nuanced insights, precise risk assessment, and tailored interventions, ultimately - enhancing the quality of care delivery and patient outcomes in clinical practice.
Feature engineering¶
Importing ID's mapping CSV¶
ID_map=pd.read_csv('IDs_mapping.csv')
ID_map.loc[:,['admission_type_id','description']].head(8)
| admission_type_id | description | |
|---|---|---|
| 0 | 1.0 | Emergency |
| 1 | 2.0 | Urgent |
| 2 | 3.0 | Elective |
| 3 | 4.0 | Newborn |
| 4 | 5.0 | Not Available |
| 5 | 6.0 | NaN |
| 6 | 7.0 | Trauma Center |
| 7 | 8.0 | Not Mapped |
According to ID's mapping¶
admission_type_id description:
- These 8 values correspond to:
1.Emergency 2.Urgent 3.Elective 4.Newborn 5.Not Avallable 6.NULL 7.Trauma Center 8.Not Mapped
From this description we can see that the majority patients that heve been admitted in this hospital have an admission type as emergency.
For future analysis we can merged these categories into less number of categories.
Emergency/Urgent/Trauma Center-To-Emergency
Not Available/Null/Not Mapped-TO-Not Avaliable
Elective & New Born remain same
df['admission_type_id']=df['admission_type_id'].replace({1:'Emergency',2:'Emergency',7:'Emergency',5:'Not Available',6:'Not Available',8:'Not Available',3:'Elective',4:'Newborn'})
df['admission_type_id'].value_counts()
admission_type_id Emergency 72491 Elective 18869 Not Available 10396 Newborn 10 Name: count, dtype: int64
- From admission_type_id dropping Newborn category as its not giving correct information if we compare it with the age
df[df.loc[:,'admission_type_id'] == 'Newborn']
| race | gender | age | admission_type_id | discharge_disposition_id | admission_source_id | time_in_hospital | num_lab_procedures | num_procedures | num_medications | ... | citoglipton | insulin | glyburide-metformin | glipizide-metformin | glimepiride-pioglitazone | metformin-rosiglitazone | metformin-pioglitazone | change | diabetesMed | readmitted | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2043 | Caucasian | Male | 55 | Newborn | 1 | 4 | 2 | 56 | 2 | 13 | ... | No | Steady | No | No | No | No | No | No | Yes | NO |
| 2203 | Caucasian | Male | 85 | Newborn | 1 | 4 | 2 | 46 | 6 | 14 | ... | No | Steady | No | No | No | No | No | No | Yes | >30 |
| 2461 | AfricanAmerican | Male | 75 | Newborn | 6 | 4 | 5 | 73 | 1 | 12 | ... | No | Up | No | No | No | No | No | Ch | Yes | NO |
| 4823 | Caucasian | Female | 65 | Newborn | 6 | 4 | 2 | 39 | 0 | 7 | ... | No | Steady | No | No | No | No | No | No | Yes | <30 |
| 35877 | AfricanAmerican | Male | 65 | Newborn | 1 | 14 | 3 | 66 | 0 | 18 | ... | No | Steady | No | No | No | No | No | No | Yes | NO |
| 47548 | Hispanic | Male | 45 | Newborn | 1 | 7 | 6 | 39 | 4 | 7 | ... | No | No | No | No | No | No | No | No | No | NO |
| 48711 | Caucasian | Female | 75 | Newborn | 6 | 1 | 4 | 37 | 1 | 18 | ... | No | No | No | No | No | No | No | Ch | Yes | >30 |
| 80354 | Hispanic | Female | 5 | Newborn | 1 | 1 | 2 | 27 | 0 | 2 | ... | No | No | No | No | No | No | No | No | No | NO |
| 87714 | Caucasian | Male | 65 | Newborn | 1 | 4 | 3 | 28 | 0 | 15 | ... | No | Steady | Steady | No | No | No | No | Ch | Yes | NO |
| 100721 | Asian | Male | 85 | Newborn | 6 | 7 | 3 | 51 | 0 | 10 | ... | No | No | No | No | No | No | No | No | Yes | NO |
10 rows × 45 columns
df.drop(df.index[df['admission_type_id'] == 'Newborn'],inplace = True)
df['admission_type_id'].value_counts()
admission_type_id Emergency 72491 Elective 18869 Not Available 10396 Name: count, dtype: int64
#7.discharge_disposition_id:
df['discharge_disposition_id'].value_counts()
discharge_disposition_id 1 60228 3 13954 6 12898 18 3691 2 2128 22 1993 11 1642 5 1184 25 989 4 815 7 623 23 412 13 399 14 372 28 139 8 108 15 63 24 48 9 21 17 14 16 11 19 8 10 6 27 5 12 3 20 2 Name: count, dtype: int64
df['discharge_disposition_id'].unique()
array([25, 1, 3, 6, 2, 5, 11, 7, 10, 4, 14, 18, 8, 13, 12, 16, 17,
22, 23, 9, 20, 15, 24, 28, 19, 27], dtype=int64)
According to ID's mapping:¶
- discharge_disposition_id description:
ID_map.loc[:,['discharge_disposition_id','description1']].head(30)
| discharge_disposition_id | description1 | |
|---|---|---|
| 0 | 1 | Discharged to home |
| 1 | 2 | Discharged/transferred to another short term h... |
| 2 | 3 | Discharged/transferred to SNF |
| 3 | 4 | Discharged/transferred to ICF |
| 4 | 5 | Discharged/transferred to another type of inpa... |
| 5 | 6 | Discharged/transferred to home with home healt... |
| 6 | 7 | Left AMA |
| 7 | 8 | Discharged/transferred to home under care of H... |
| 8 | 9 | Admitted as an inpatient to this hospital |
| 9 | 10 | Neonate discharged to another hospital for neo... |
| 10 | 11 | Expired |
| 11 | 12 | Still patient or expected to return for outpat... |
| 12 | 13 | Hospice / home |
| 13 | 14 | Hospice / medical facility |
| 14 | 15 | Discharged/transferred within this institution... |
| 15 | 16 | Discharged/transferred/referred another instit... |
| 16 | 17 | Discharged/transferred/referred to this instit... |
| 17 | 18 | NaN |
| 18 | 19 | Expired at home. Medicaid only, hospice. |
| 19 | 20 | Expired in a medical facility. Medicaid only, ... |
| 20 | 21 | Expired, place unknown. Medicaid only, hospice. |
| 21 | 22 | Discharged/transferred to another rehab fac in... |
| 22 | 23 | Discharged/transferred to a long term care hos... |
| 23 | 24 | Discharged/transferred to a nursing facility c... |
| 24 | 25 | Not Mapped |
| 25 | 26 | Unknown/Invalid |
| 26 | 30 | Discharged/transferred to another Type of Heal... |
| 27 | 27 | Discharged/transferred to a federal health car... |
| 28 | 28 | Discharged/transferred/referred to a psychiatr... |
| 29 | 29 | Discharged/transferred to a Critical Access Ho... |
We are merging several categories of discharge_diposition_id into fewer categories:¶
- discharge_disposition_id: 1,6,8-TO-Discharge to home
- discharge_disposition_id: 2,3,4,5,16,22,23,24,27,28,29,30-TO-Transferred to another facility
- discharge_disposition_id: 7-TO-Left AMA (Against Medical Advice)
- discharge_disposition_id: 9,12,15,17-TO-Still patient/referred to this institution
- discharge_disposition_id: 10 -TO-Neonate discharged
- discharge_disposition_id: 11,19,20,21 -TO-Expired
- discharge_disposition_id: 13,14 -TO-Hospice
- discharge_disposition_id: 18,25,26 -TO-Not Available
df['discharge_disposition_id']=df['discharge_disposition_id'].replace({1:'Discharged to home',
6:'Discharged to home',8:'Discharged to home',2:'Transferred to another facility',
3:'Transferred to another facility',4:'Transferred to another facility',5:'Transferred to another facility',
16:'Transferred to another facility',22:'Transferred to another facility',23:'Transferred to another facility',
24:'Transferred to another facility',27:'Transferred to another facility',28:'Transferred to another facility',
29:'Transferred to another facility',30:'Transferred to another facility',7:'Left AMA',
9:'Still patient/referred to this institution',12:'Still patient/referred to this institution',
15:'Still patient/referred to this institution',17:'Still patient/referred to this institution',
10:'Neonate discharged',11:'Expired',19:'Expired',20:'Expired',21:'Expired',13:'Hospice',14:'Hospice',
18:'Not Available',25:'Not Available',26:'Not Available'})
df['discharge_disposition_id'].value_counts()
discharge_disposition_id Discharged to home 73234 Transferred to another facility 20689 Not Available 4680 Expired 1652 Hospice 771 Left AMA 623 Still patient/referred to this institution 101 Neonate discharged 6 Name: count, dtype: int64
- The patients who have expired,Hospice and Neonate(newborn child)should be removed from the column because they will create bias as they will not be readmitted
df = df[df['discharge_disposition_id']!='Expired']
df = df[df['discharge_disposition_id']!='Neonate discharged']
df = df[df['discharge_disposition_id']!='Hospice']
df['discharge_disposition_id'].value_counts()
discharge_disposition_id Discharged to home 73234 Transferred to another facility 20689 Not Available 4680 Left AMA 623 Still patient/referred to this institution 101 Name: count, dtype: int64
#8 admission_source_id:
df['admission_source_id'].value_counts()
admission_source_id 7 55845 1 29163 17 6570 4 3113 6 2239 2 1081 5 806 3 185 20 159 9 125 8 15 22 12 10 8 11 2 25 2 14 1 13 1 Name: count, dtype: int64
df['admission_source_id'].nunique()
17
According to ID's mapping:¶
admission_source_id description:
ID_map.loc[:,['admission_source_id','description2']].head(25)
| admission_source_id | description2 | |
|---|---|---|
| 0 | 1.0 | Physician Referral |
| 1 | 2.0 | Clinic Referral |
| 2 | 3.0 | HMO Referral |
| 3 | 4.0 | Transfer from a hospital |
| 4 | 5.0 | Transfer from a Skilled Nursing Facility (SNF) |
| 5 | 6.0 | Transfer from another health care facility |
| 6 | 7.0 | Emergency Room |
| 7 | 8.0 | Court/Law Enforcement |
| 8 | 9.0 | Not Available |
| 9 | 10.0 | Transfer from critial access hospital |
| 10 | 11.0 | Normal Delivery |
| 11 | 12.0 | Premature Delivery |
| 12 | 13.0 | Sick Baby |
| 13 | 14.0 | Extramural Birth |
| 14 | 15.0 | Not Available |
| 15 | 17.0 | NaN |
| 16 | 18.0 | Transfer From Another Home Health Agency |
| 17 | 19.0 | Readmission to Same Home Health Agency |
| 18 | 20.0 | Not Mapped |
| 19 | 21.0 | Unknown/Invalid |
| 20 | 22.0 | Transfer from hospital inpt/same fac reslt in... |
| 21 | 23.0 | Born inside this hospital |
| 22 | 24.0 | Born outside this hospital |
| 23 | 25.0 | Transfer from Ambulatory Surgery Center |
| 24 | 26.0 | Transfer from Hospice |
We are merging several categories of admission_source_id into fewer categories:¶
- admission_source_id: 1,2,3 -TO-Referral
- admission_source_id: 4,5,6,10,18,19,22,25,26 -TO-Transferred from hospital(another care facility)
- admission_source_id: 7,8 -TO-Emergency
- admission_source_id: 9,15,17,20,21 -TO-Not Available
- admission_source_id: 11,12,14,13,23,24 -TO-Delivery
df['admission_source_id']=df['admission_source_id'].replace({1:'Referral',2:'Referral',3:'Referral',4:'Transferred from hospital',5:'Transferred from hospital',6:'Transferred from hospital',10:'Transferred from hospital',18:'Transferred from hospital',19:'Transferred from hospital',25:'Transferred from hospital',26:'Transferred from hospital',22:'Transferred from hospital',7:'Emergency',8:'Emergency',9:'Not Available',15:'Not Available',17:'Not Available',20:'Not Available',21:'Not Available',11:'Delivery',12:'Delivery', 13:'Delivery',14:'Delivery',23:'Delivery',24:'Delivery'})
df['admission_source_id'].value_counts()
admission_source_id Emergency 55860 Referral 30429 Not Available 6854 Transferred from hospital 6180 Delivery 4 Name: count, dtype: int64
- Dropping the category delivery because the admission_source_id == delivery and the age corresponding to it give contradictory information
df=df[df['admission_source_id']!='Delivery']
df['admission_source_id'].value_counts()
admission_source_id Emergency 55860 Referral 30429 Not Available 6854 Transferred from hospital 6180 Name: count, dtype: int64
#9.time_in_hospital
df['time_in_hospital'].value_counts(sort=False)
time_in_hospital 1 13824 3 17425 2 16886 4 13682 5 9747 13 1152 12 1383 9 2878 7 5696 10 2262 6 7354 11 1770 8 4269 14 995 Name: count, dtype: int64
- The time in the hospital ranges from 1 day to 14 days
#10.num_lab_procedures
df['num_lab_procedures'].unique()
array([ 41, 59, 11, 44, 51, 31, 70, 73, 68, 33, 47, 62, 60,
55, 49, 75, 45, 29, 35, 42, 66, 36, 19, 64, 25, 53,
52, 87, 27, 37, 28, 48, 10, 2, 65, 67, 40, 54, 58,
57, 43, 32, 83, 34, 39, 69, 38, 72, 22, 96, 46, 78,
56, 61, 88, 50, 1, 18, 82, 9, 63, 24, 71, 77, 81,
76, 90, 93, 3, 103, 13, 80, 85, 16, 15, 12, 30, 23,
17, 21, 79, 26, 5, 95, 97, 84, 14, 74, 105, 86, 98,
20, 6, 94, 8, 102, 7, 89, 91, 92, 4, 101, 99, 100,
114, 113, 111, 129, 107, 108, 106, 104, 109, 120, 132, 121, 126,
118], dtype=int64)
print('The number of lab tests a patient has undergone can range from',df.num_lab_procedures.min(),'to',df.num_lab_procedures.max())
The number of lab tests a patient has undergone can range from 1 to 132
11#num_procedures:
df.num_procedures.value_counts()
num_procedures 0 45672 1 20247 2 12367 3 9202 6 4799 4 4048 5 2988 Name: count, dtype: int64
No of procedures range from 0 to 5
#12. num_medication:
print('The number of medication range from a minimum of',df.num_medications.min(),'to maximum of',df.num_medications.max())
The number of medication range from a minimum of 1 to maximum of 81
- The number of medications range from a minimum of 1 to maximum of 81
1.number_outpatient: Number of outpatient visits of the patient in the year preceding the encounter
2.number_emergency: Number of emergency visits of the patient in the year preceding the encounter
3.number_inpatient: Number of inpatient visits of the patient in the year preceding the encounter
df['number_outpatient'].value_counts()
number_outpatient 0 82976 1 8348 2 3509 3 1996 4 1077 5 514 6 294 7 154 8 96 9 82 10 57 11 41 13 30 12 29 14 28 15 20 16 13 17 8 21 6 20 6 18 5 22 5 19 3 27 3 24 3 26 2 23 2 33 2 35 2 36 2 29 2 39 1 34 1 42 1 25 1 28 1 37 1 38 1 40 1 Name: count, dtype: int64
df['number_emergency'].value_counts()
number_emergency 0 88229 1 7474 2 1984 3 706 4 369 5 190 6 93 7 72 8 50 10 34 9 33 11 23 13 12 12 10 22 6 16 5 18 5 19 4 20 4 15 3 14 3 25 2 21 2 28 1 42 1 46 1 76 1 37 1 64 1 63 1 54 1 24 1 29 1 Name: count, dtype: int64
df['number_inpatient'].value_counts()
number_inpatient 0 66230 1 18980 2 7299 3 3271 4 1574 5 791 6 474 7 262 8 145 9 109 10 59 11 49 12 32 13 18 14 10 15 9 16 6 19 2 17 1 21 1 18 1 Name: count, dtype: int64
Service utilization: The data contains variables for number of inpatient(admissions).Emergency room visits and outpatient visits for a given patient in the previous one year.These are (crude) measures of how much hospital/clinic services a person has used in the past year.we added these three to create a new variable called service utilization(sea figure below). The idea was to see which version gives us better results.Granted,we did not apply any special weighting to the three ingredients of service utilization but we wanted to try something simple at this stage. so we combined these features into a new variable called Service utilization
df['service_utilization'] = df['number_outpatient']+df['number_emergency']+df['number_inpatient'] #year_visits
df.drop(['number_outpatient','number_emergency','number_inpatient'],axis=1,inplace=True)
df.columns
Index(['race', 'gender', 'age', 'admission_type_id',
'discharge_disposition_id', 'admission_source_id', 'time_in_hospital',
'num_lab_procedures', 'num_procedures', 'num_medications', 'diag_1',
'diag_2', 'diag_3', 'number_diagnoses', 'max_glu_serum', 'A1Cresult',
'metformin', 'repaglinide', 'nateglinide', 'chlorpropamide',
'glimepiride', 'acetohexamide', 'glipizide', 'glyburide', 'tolbutamide',
'pioglitazone', 'rosiglitazone', 'acarbose', 'miglitol', 'troglitazone',
'tolazamide', 'examide', 'citoglipton', 'insulin',
'glyburide-metformin', 'glipizide-metformin',
'glimepiride-pioglitazone', 'metformin-rosiglitazone',
'metformin-pioglitazone', 'change', 'diabetesMed', 'readmitted',
'service_utilization'],
dtype='object')
df.head()
| race | gender | age | admission_type_id | discharge_disposition_id | admission_source_id | time_in_hospital | num_lab_procedures | num_procedures | num_medications | ... | insulin | glyburide-metformin | glipizide-metformin | glimepiride-pioglitazone | metformin-rosiglitazone | metformin-pioglitazone | change | diabetesMed | readmitted | service_utilization | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Caucasian | Female | 5 | Not Available | Not Available | Referral | 1 | 41 | 0 | 1 | ... | No | No | No | No | No | No | No | No | NO | 0 |
| 1 | Caucasian | Female | 15 | Emergency | Discharged to home | Emergency | 3 | 59 | 0 | 18 | ... | Up | No | No | No | No | No | Ch | Yes | >30 | 0 |
| 2 | AfricanAmerican | Female | 25 | Emergency | Discharged to home | Emergency | 2 | 11 | 5 | 13 | ... | No | No | No | No | No | No | No | Yes | NO | 3 |
| 3 | Caucasian | Male | 35 | Emergency | Discharged to home | Emergency | 2 | 44 | 1 | 16 | ... | Up | No | No | No | No | No | Ch | Yes | NO | 0 |
| 4 | Caucasian | Male | 45 | Emergency | Discharged to home | Emergency | 1 | 51 | 0 | 8 | ... | Steady | No | No | No | No | No | Ch | Yes | NO | 0 |
5 rows × 43 columns
15.Diagnosis 1 : The primary diagnosis (coded as first three digit of ICD9):848 distint values
16.Diagnosis 2 : Secondary diagnosis (coded as first three digit of ICD9):923 distint values
17.Diagnosis 3 : Additional secondary diagnosis (coded as first three digit of ICD9):954 distint values
df.isnull().sum()
race 0 gender 0 age 0 admission_type_id 0 discharge_disposition_id 0 admission_source_id 0 time_in_hospital 0 num_lab_procedures 0 num_procedures 0 num_medications 0 diag_1 20 diag_2 356 diag_3 1419 number_diagnoses 0 max_glu_serum 0 A1Cresult 0 metformin 0 repaglinide 0 nateglinide 0 chlorpropamide 0 glimepiride 0 acetohexamide 0 glipizide 0 glyburide 0 tolbutamide 0 pioglitazone 0 rosiglitazone 0 acarbose 0 miglitol 0 troglitazone 0 tolazamide 0 examide 0 citoglipton 0 insulin 0 glyburide-metformin 0 glipizide-metformin 0 glimepiride-pioglitazone 0 metformin-rosiglitazone 0 metformin-pioglitazone 0 change 0 diabetesMed 0 readmitted 0 service_utilization 0 dtype: int64
Treating missing values in diag_1,diag_2,diag_3 by dropping all the missing values
#diag_1,diag_2,diag_3
df=df.dropna(subset=['diag_1','diag_2','diag_3'])
df.isnull().sum()
race 0 gender 0 age 0 admission_type_id 0 discharge_disposition_id 0 admission_source_id 0 time_in_hospital 0 num_lab_procedures 0 num_procedures 0 num_medications 0 diag_1 0 diag_2 0 diag_3 0 number_diagnoses 0 max_glu_serum 0 A1Cresult 0 metformin 0 repaglinide 0 nateglinide 0 chlorpropamide 0 glimepiride 0 acetohexamide 0 glipizide 0 glyburide 0 tolbutamide 0 pioglitazone 0 rosiglitazone 0 acarbose 0 miglitol 0 troglitazone 0 tolazamide 0 examide 0 citoglipton 0 insulin 0 glyburide-metformin 0 glipizide-metformin 0 glimepiride-pioglitazone 0 metformin-rosiglitazone 0 metformin-pioglitazone 0 change 0 diabetesMed 0 readmitted 0 service_utilization 0 dtype: int64
#15.number_diagnoses:
df['number_diagnoses'].value_counts()
number_diagnoses 9 47797 5 11165 8 10305 7 10194 6 9967 4 5464 3 2804 16 42 13 16 10 16 11 11 15 10 12 8 14 6 Name: count, dtype: int64
#16.max_glu_serum
df.max_glu_serum.value_counts()
max_glu_serum Test Not Done 92741 Norm 2504 >200 1398 >300 1162 Name: count, dtype: int64
df.max_glu_serum.value_counts()
max_glu_serum Test Not Done 92741 Norm 2504 >200 1398 >300 1162 Name: count, dtype: int64
- It indicates the range of the result was not taken values,">200",">300","normal" and "Test Not Done" if not measured
- We also reducing Glucose serum test into categories of Normal(1),Abnormal(2) and Not tested(0)
df["max_glu_serum"] = df["max_glu_serum"].replace({">200":2,">300":2,"Norm":1,"Test Not Done":0})
df.max_glu_serum.value_counts()
max_glu_serum 0 92741 2 2560 1 2504 Name: count, dtype: int64
Encoding¶
The outcome we are looking at is whether the patient gets readmitted to the hospital within 30 days or not.
df['readmitted'].value_counts()
readmitted NO 51475 >30 35124 <30 11206 Name: count, dtype: int64
#Readmission converting multiclass to binary class
df['readmitted'] = df['readmitted'].replace({">30":0,'NO':0,"<30":1})
df['readmitted'].value_counts()
readmitted 0 86599 1 11206 Name: count, dtype: int64
# 21.A1Cresult
- A1c test result indicates the range of the result or if the test was not taken. values: ">8" if the result was greater than 8%,">7" if the result was greater than 7% but less than 8%,'normal' if the result was less than 7%, and 'none' if not measured
- When sugar enters your bloodstream, it attaches to hemoglobin, a protein in your red blood cells.Everybody has some sugar attached to their hemoglobin, but people with higher blood sugar levels have more. The A1C test measures the percentage of your red blood cells that have sugar-coated hemoglobin.
df['A1Cresult'].unique()
array(['Test Not Done', '>7', '>8', 'Norm'], dtype=object)
df['A1Cresult'].value_counts()
A1Cresult Test Not Done 81393 >8 7797 Norm 4867 >7 3748 Name: count, dtype: int64
A normal A1C levels is below 5.7%,a level of 5.7% to 6.4% indicates prediabetes,and a level of 6.5% or more indicates diabetes.Within the 5.7% to 6.4% prediabetes range,The higher your A1C.the greater your risk is for developing type 2 diabetes.
df['A1Cresult'] = df['A1Cresult'].replace({"Test Not Done":0,"Norm":1,">7":2,">8":2})
df['A1Cresult'].unique()
array([0, 2, 1], dtype=int64)
#Gender
df['gender'] = df['gender'].replace({'Female':0,'Male':1})
1.Change: Change of medication indicates if there was a change in diabetic medication(either dosage or generic).Values:'change' and 'no change'
df['change'].value_counts()
change No 52346 Ch 45459 Name: count, dtype: int64
we encoded the 'medication change' feature from 'No'(no change) and 'Ch'(changed) into 0 and 1.
df['change'] = df['change'].replace({'No':0,'Ch':1})
1.diabetesMed: Indicates if there was any diabetic medication prescribed; Values: 'yes' and 'no'
df['diabetesMed'].value_counts()
diabetesMed Yes 75497 No 22308 Name: count, dtype: int64
# encoding diabetes med
df['diabetesMed'] = df['diabetesMed'].replace('Yes',1)
df['diabetesMed'] = df['diabetesMed'].replace('No',0)
df['diabetesMed'].value_counts()
diabetesMed 1 75497 0 22308 Name: count, dtype: int64
df.columns
Index(['race', 'gender', 'age', 'admission_type_id',
'discharge_disposition_id', 'admission_source_id', 'time_in_hospital',
'num_lab_procedures', 'num_procedures', 'num_medications', 'diag_1',
'diag_2', 'diag_3', 'number_diagnoses', 'max_glu_serum', 'A1Cresult',
'metformin', 'repaglinide', 'nateglinide', 'chlorpropamide',
'glimepiride', 'acetohexamide', 'glipizide', 'glyburide', 'tolbutamide',
'pioglitazone', 'rosiglitazone', 'acarbose', 'miglitol', 'troglitazone',
'tolazamide', 'examide', 'citoglipton', 'insulin',
'glyburide-metformin', 'glipizide-metformin',
'glimepiride-pioglitazone', 'metformin-rosiglitazone',
'metformin-pioglitazone', 'change', 'diabetesMed', 'readmitted',
'service_utilization'],
dtype='object')
- 23 features for medications For the generic names: metformin,repaglinide,nateglinide,chlorpropamide,glimepiride,acetohexamide,glipizide,glyburide,tolbutamide,pioglitazone,rosiglitazone,acarbose,miglitol,troglitazone,tolazamide,examide,citoglipton,insulin, glyburide-metformin,glipizide-metformin,glimepiride-pioglitazone,metformin-rosiglitazone and metformin-pioglitazone.the features indicates whether the drug was prescribed or there was Values 'up' if the dosage was increased during the encounter,'down' if the dosage was decreased,'steady' if the dosage did not change, and 'no' if the drug was not prescribed
drug = list(df.iloc[:,16:39].columns)
drug
['metformin', 'repaglinide', 'nateglinide', 'chlorpropamide', 'glimepiride', 'acetohexamide', 'glipizide', 'glyburide', 'tolbutamide', 'pioglitazone', 'rosiglitazone', 'acarbose', 'miglitol', 'troglitazone', 'tolazamide', 'examide', 'citoglipton', 'insulin', 'glyburide-metformin', 'glipizide-metformin', 'glimepiride-pioglitazone', 'metformin-rosiglitazone', 'metformin-pioglitazone']
for i in drug:
print(i,df[i].unique())
metformin ['No' 'Steady' 'Up' 'Down'] repaglinide ['No' 'Up' 'Steady' 'Down'] nateglinide ['No' 'Steady' 'Down' 'Up'] chlorpropamide ['No' 'Steady' 'Down' 'Up'] glimepiride ['No' 'Steady' 'Down' 'Up'] acetohexamide ['No' 'Steady'] glipizide ['No' 'Steady' 'Up' 'Down'] glyburide ['No' 'Steady' 'Up' 'Down'] tolbutamide ['No' 'Steady'] pioglitazone ['No' 'Steady' 'Up' 'Down'] rosiglitazone ['No' 'Steady' 'Up' 'Down'] acarbose ['No' 'Steady' 'Up' 'Down'] miglitol ['No' 'Steady' 'Down' 'Up'] troglitazone ['No' 'Steady'] tolazamide ['No' 'Steady' 'Up'] examide ['No'] citoglipton ['No'] insulin ['Up' 'No' 'Steady' 'Down'] glyburide-metformin ['No' 'Steady' 'Down' 'Up'] glipizide-metformin ['No' 'Steady'] glimepiride-pioglitazone ['No' 'Steady'] metformin-rosiglitazone ['No' 'Steady'] metformin-pioglitazone ['No' 'Steady']
# Assuming 'drug' is a list containing the drug names
for i in drug:
print(i)
sns.countplot(x=i, data=df)
plt.show()
metformin
repaglinide
nateglinide
chlorpropamide
glimepiride
acetohexamide
glipizide
glyburide
tolbutamide
pioglitazone
rosiglitazone
acarbose
miglitol
troglitazone
tolazamide
examide
citoglipton
insulin
glyburide-metformin
glipizide-metformin
glimepiride-pioglitazone
metformin-rosiglitazone
metformin-pioglitazone
variable (drugs named citoglipton and examide), all records have the same value.So essentially these cannot provide any interpretive or discriminatory information for predicting readmission so we decided to drop these two variables
df['citoglipton'].value_counts()
citoglipton No 97805 Name: count, dtype: int64
df['examide'].value_counts()
examide No 97805 Name: count, dtype: int64
df = df.drop(['citoglipton','examide'],axis=1)
l1 = ['metformin','glipizide','insulin','glyburide']
for i in l1:
print(i)
sns.countplot(x=i, data=df)
plt.show()
metformin
glipizide
insulin
glyburide
for i in l1:
print(i)
print((df[i].value_counts(normalize=True)*100))
metformin metformin No 80.004090 Steady 18.355912 Up 1.064363 Down 0.575635 Name: proportion, dtype: float64 glipizide glipizide No 87.316599 Steady 11.348091 Up 0.778079 Down 0.557231 Name: proportion, dtype: float64 insulin insulin No 46.818670 Steady 30.144676 Down 11.972803 Up 11.063852 Name: proportion, dtype: float64 glyburide glyburide No 89.369664 Steady 9.250038 Up 0.815909 Down 0.564388 Name: proportion, dtype: float64
for i in l1:
df[i]=df[i].apply(lambda x:0 if (x== 'No' or x=='Steady') else 1)
df.insulin.value_counts()
insulin 0 75274 1 22531 Name: count, dtype: int64
# drop rest of the medication columns
df = df.drop(['repaglinide',
'nateglinide',
'chlorpropamide',
'glimepiride',
'acetohexamide',
'tolbutamide',
'pioglitazone',
'rosiglitazone',
'acarbose',
'miglitol',
'troglitazone',
'tolazamide',
'glyburide-metformin',
'glipizide-metformin',
'glimepiride-pioglitazone',
'metformin-rosiglitazone',
'metformin-pioglitazone'],axis=1)
df.head()
| race | gender | age | admission_type_id | discharge_disposition_id | admission_source_id | time_in_hospital | num_lab_procedures | num_procedures | num_medications | ... | max_glu_serum | A1Cresult | metformin | glipizide | glyburide | insulin | change | diabetesMed | readmitted | service_utilization | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Caucasian | 0 | 15 | Emergency | Discharged to home | Emergency | 3 | 59 | 0 | 18 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 0 |
| 2 | AfricanAmerican | 0 | 25 | Emergency | Discharged to home | Emergency | 2 | 11 | 5 | 13 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 3 |
| 3 | Caucasian | 1 | 35 | Emergency | Discharged to home | Emergency | 2 | 44 | 1 | 16 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 0 |
| 4 | Caucasian | 1 | 45 | Emergency | Discharged to home | Emergency | 1 | 51 | 0 | 8 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 |
| 5 | Caucasian | 1 | 55 | Emergency | Discharged to home | Referral | 3 | 31 | 6 | 16 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
5 rows × 24 columns
df.race.unique()
array(['Caucasian', 'AfricanAmerican', 'Other', 'Asian', 'Hispanic'],
dtype=object)
fig = plt.figure(figsize=(8,8))
sns.countplot(y=df['race'],hue=df['readmitted'])
<Axes: xlabel='count', ylabel='race'>
race_dummies = pd.get_dummies(df['race'],drop_first=True)
race_dummies.head()
| Asian | Caucasian | Hispanic | Other | |
|---|---|---|---|---|
| 1 | False | True | False | False |
| 2 | False | False | False | False |
| 3 | False | True | False | False |
| 4 | False | True | False | False |
| 5 | False | True | False | False |
df = pd.concat([df,race_dummies],axis=1)
df.drop(['race'],inplace=True,axis=1)
df.head()
| gender | age | admission_type_id | discharge_disposition_id | admission_source_id | time_in_hospital | num_lab_procedures | num_procedures | num_medications | diag_1 | ... | glyburide | insulin | change | diabetesMed | readmitted | service_utilization | Asian | Caucasian | Hispanic | Other | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 0 | 15 | Emergency | Discharged to home | Emergency | 3 | 59 | 0 | 18 | 276 | ... | 0 | 1 | 1 | 1 | 0 | 0 | False | True | False | False |
| 2 | 0 | 25 | Emergency | Discharged to home | Emergency | 2 | 11 | 5 | 13 | 648 | ... | 0 | 0 | 0 | 1 | 0 | 3 | False | False | False | False |
| 3 | 1 | 35 | Emergency | Discharged to home | Emergency | 2 | 44 | 1 | 16 | 8 | ... | 0 | 1 | 1 | 1 | 0 | 0 | False | True | False | False |
| 4 | 1 | 45 | Emergency | Discharged to home | Emergency | 1 | 51 | 0 | 8 | 197 | ... | 0 | 0 | 1 | 1 | 0 | 0 | False | True | False | False |
| 5 | 1 | 55 | Emergency | Discharged to home | Referral | 3 | 31 | 6 | 16 | 414 | ... | 0 | 0 | 0 | 1 | 0 | 0 | False | True | False | False |
5 rows × 27 columns
fig = plt.figure(figsize=(8, 8))
sns.countplot(x='gender', hue='readmitted', data=df)
plt.title("Gender of patient vs readmission")
plt.show()
#Encoding
df['admission_type_id'].value_counts()
admission_type_id Emergency 69486 Elective 18344 Not Available 9975 Name: count, dtype: int64
df['admission_type_id'].unique()
array(['Emergency', 'Elective', 'Not Available'], dtype=object)
df['admission_type_id']=df['admission_type_id'].replace({'Emergency':0,'Elective':1,'Not Available':2})
1.discharge_disposition_id
df['discharge_disposition_id'].value_counts()
discharge_disposition_id Discharged to home 71928 Transferred to another facility 20574 Not Available 4598 Left AMA 604 Still patient/referred to this institution 101 Name: count, dtype: int64
df['discharge_disposition_id'].unique()
array(['Discharged to home', 'Transferred to another facility',
'Left AMA', 'Not Available',
'Still patient/referred to this institution'], dtype=object)
df['discharge_disposition_id']=df['discharge_disposition_id'].replace({'Discharged to home':0,'Transferred to another facility':1,'Not Available':2,'Left AMA':3,'Still patient/referred to this institution':4})
df['discharge_disposition_id'].value_counts()
discharge_disposition_id 0 71928 1 20574 2 4598 3 604 4 101 Name: count, dtype: int64
1.admission_source_id
df['admission_source_id'].value_counts()
admission_source_id Emergency 55166 Referral 29881 Not Available 6743 Transferred from hospital 6015 Name: count, dtype: int64
df['admission_source_id']=df['admission_source_id'].replace({'Emergency':0,'Referral':1,'Not Available':2,'Transferred from hospital':3})
df['admission_source_id'].value_counts()
admission_source_id 0 55166 1 29881 2 6743 3 6015 Name: count, dtype: int64
df.head()
| gender | age | admission_type_id | discharge_disposition_id | admission_source_id | time_in_hospital | num_lab_procedures | num_procedures | num_medications | diag_1 | ... | glyburide | insulin | change | diabetesMed | readmitted | service_utilization | Asian | Caucasian | Hispanic | Other | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 0 | 15 | 0 | 0 | 0 | 3 | 59 | 0 | 18 | 276 | ... | 0 | 1 | 1 | 1 | 0 | 0 | False | True | False | False |
| 2 | 0 | 25 | 0 | 0 | 0 | 2 | 11 | 5 | 13 | 648 | ... | 0 | 0 | 0 | 1 | 0 | 3 | False | False | False | False |
| 3 | 1 | 35 | 0 | 0 | 0 | 2 | 44 | 1 | 16 | 8 | ... | 0 | 1 | 1 | 1 | 0 | 0 | False | True | False | False |
| 4 | 1 | 45 | 0 | 0 | 0 | 1 | 51 | 0 | 8 | 197 | ... | 0 | 0 | 1 | 1 | 0 | 0 | False | True | False | False |
| 5 | 1 | 55 | 0 | 0 | 1 | 3 | 31 | 6 | 16 | 414 | ... | 0 | 0 | 0 | 1 | 0 | 0 | False | True | False | False |
5 rows × 27 columns
df.info()
<class 'pandas.core.frame.DataFrame'> Index: 97805 entries, 1 to 101765 Data columns (total 27 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 gender 97805 non-null int64 1 age 97805 non-null int64 2 admission_type_id 97805 non-null int64 3 discharge_disposition_id 97805 non-null int64 4 admission_source_id 97805 non-null int64 5 time_in_hospital 97805 non-null int64 6 num_lab_procedures 97805 non-null int64 7 num_procedures 97805 non-null int64 8 num_medications 97805 non-null int64 9 diag_1 97805 non-null object 10 diag_2 97805 non-null object 11 diag_3 97805 non-null object 12 number_diagnoses 97805 non-null int64 13 max_glu_serum 97805 non-null int64 14 A1Cresult 97805 non-null int64 15 metformin 97805 non-null int64 16 glipizide 97805 non-null int64 17 glyburide 97805 non-null int64 18 insulin 97805 non-null int64 19 change 97805 non-null int64 20 diabetesMed 97805 non-null int64 21 readmitted 97805 non-null int64 22 service_utilization 97805 non-null int64 23 Asian 97805 non-null bool 24 Caucasian 97805 non-null bool 25 Hispanic 97805 non-null bool 26 Other 97805 non-null bool dtypes: bool(4), int64(20), object(3) memory usage: 18.3+ MB
- diag_1,diag_2,diag_3
- Diagnosis 1 (Nominal) The primary diagnosis(coded as first three digits of ICD9),848 distint values.
- Diagnosis 2 (Nominal) The secondary diagnosis(coded as first three digits of ICD9),923 distint values.
- Diagnosis 1 (Nominal) Additional secondary diagnosis(coded as first three digits of ICD9),954 distint values.
- Diabetes:0
- Circulatort:1
- Respiratory:2
- Digestive:3
- Injury:4
- Musculoskelet:5
- Genitourinory:6
- Neoplasms:7
- Others:8
df.loc[df['diag_1'].str.contains('V',na=False),['diag_1']]=8 #others
df.loc[df['diag_1'].str.contains('E',na=False),['diag_1']]=8
df.loc[df['diag_2'].str.contains('V',na=False),['diag_2']]=8
df.loc[df['diag_2'].str.contains('E',na=False),['diag_2']]=8
df.loc[df['diag_3'].str.contains('V',na=False),['diag_3']]=8
df.loc[df['diag_3'].str.contains('E',na=False),['diag_3']]=8
df['diag_1']=df['diag_1'].astype('float')
df['diag_2']=df['diag_2'].astype('float')
df['diag_3']=df['diag_3'].astype('float')
- Diagnosis:- stands for a primary diagnosis with possible values
- Circulatory:- for ICD9:390-459,785,
- Digestive:- for ICD9:520-579,787,
- Genitourinary:- ICD9:580-629,788,
- Diabetes:- ICD9:250,xx,
- injury:- ICD9:800-999,
- musculoskeletal:- ICD9:710-739
- neoplasms:- ICD9:140-239
- respiratory:- ICD9:460-519,786
- other:- otherwise
for j in ['diag_1', 'diag_2', 'diag_3']:
for i in range(len(df)):
value = df.iloc[i][j]
if np.floor(value) == 250:
df.iloc[i, df.columns.get_loc(j)] = 0 # Diabetes
elif (value >= 390 and value < 460) or np.floor(value) == 785:
df.iloc[i, df.columns.get_loc(j)] = 1 # Circulatory
elif (value >= 460 and value < 520) or np.floor(value) == 786:
df.iloc[i, df.columns.get_loc(j)] = 2 # Respiratory
elif (value >= 520 and value < 580) or np.floor(value) == 787:
df.iloc[i, df.columns.get_loc(j)] = 3 # Digestive
elif value >= 800 and value < 1000:
df.iloc[i, df.columns.get_loc(j)] = 4 # Injury
elif value >= 710 and value < 740:
df.iloc[i, df.columns.get_loc(j)] = 5 # Musculoskeletal
elif (value >= 580 and value < 630) or np.floor(value) == 788:
df.iloc[i, df.columns.get_loc(j)] = 6 # Genitourinary
elif value >= 140 and value < 240:
df.iloc[i, df.columns.get_loc(j)] = 7 # Neoplasms
else:
df.iloc[i, df.columns.get_loc(j)] = 8 # Others
for i in ['diag_1','diag_2','diag_3']:
print(df[i].unique())
[8. 7. 1. 0. 2. 4. 5. 3. 6.] [0. 7. 1. 2. 8. 4. 5. 6. 3.] [8. 1. 0. 2. 4. 7. 6. 5. 3.]
df['diag_1']=df['diag_1'].astype('int')
df['diag_2']=df['diag_2'].astype('int')
df['diag_3']=df['diag_3'].astype('int')
df.head()
| gender | age | admission_type_id | discharge_disposition_id | admission_source_id | time_in_hospital | num_lab_procedures | num_procedures | num_medications | diag_1 | ... | glyburide | insulin | change | diabetesMed | readmitted | service_utilization | Asian | Caucasian | Hispanic | Other | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 0 | 15 | 0 | 0 | 0 | 3 | 59 | 0 | 18 | 8 | ... | 0 | 1 | 1 | 1 | 0 | 0 | False | True | False | False |
| 2 | 0 | 25 | 0 | 0 | 0 | 2 | 11 | 5 | 13 | 8 | ... | 0 | 0 | 0 | 1 | 0 | 3 | False | False | False | False |
| 3 | 1 | 35 | 0 | 0 | 0 | 2 | 44 | 1 | 16 | 8 | ... | 0 | 1 | 1 | 1 | 0 | 0 | False | True | False | False |
| 4 | 1 | 45 | 0 | 0 | 0 | 1 | 51 | 0 | 8 | 7 | ... | 0 | 0 | 1 | 1 | 0 | 0 | False | True | False | False |
| 5 | 1 | 55 | 0 | 0 | 1 | 3 | 31 | 6 | 16 | 1 | ... | 0 | 0 | 0 | 1 | 0 | 0 | False | True | False | False |
5 rows × 27 columns
df.dtypes
gender int64 age int64 admission_type_id int64 discharge_disposition_id int64 admission_source_id int64 time_in_hospital int64 num_lab_procedures int64 num_procedures int64 num_medications int64 diag_1 int32 diag_2 int32 diag_3 int32 number_diagnoses int64 max_glu_serum int64 A1Cresult int64 metformin int64 glipizide int64 glyburide int64 insulin int64 change int64 diabetesMed int64 readmitted int64 service_utilization int64 Asian bool Caucasian bool Hispanic bool Other bool dtype: object
VISUALISATION¶
Univariate Analysis¶
gender¶
import plotly.express as px
# Assuming 'gender_counts' is already calculated
gender_counts = df['gender'].value_counts()
# Creating the pie chart in Plotly
fig = px.pie(values=gender_counts, names=gender_counts.index, title='Gender Distribution',
color_discrete_sequence=px.colors.sequential.RdBu, hole=0)
# Customizing the chart for better visualization
fig.update_traces(textposition='inside', textinfo='percent+label',
marker=dict(line=dict(color='#000000', width=15)))
# Adjusting the layout
fig.update_layout(title_text='Gender Distribution', title_x=0.5)
# Showing the plot
fig.show()
age¶
age_counts = df['age'].value_counts()
# Plotting
plt.figure(figsize=(5, 3)) # Adjust the figure size if needed
age_counts.plot(kind='bar', color='skyblue') # Create bar plot
plt.title('Distribution of Age')
plt.xlabel('age')
plt.ylabel('Count')
plt.xticks(rotation=45) # Rotate x-axis labels for better readability
plt.tight_layout() # Adjust layout to prevent clipping of labels
plt.show()
num_lab_procedures¶
plt.figure(figsize=(5,3))
sns.distplot(df['num_lab_procedures'],color='red')
<Axes: xlabel='num_lab_procedures', ylabel='Density'>
num_procedures¶
import plotly.express as px
# Plotting with Plotly
fig = px.histogram(
df,
x='num_procedures',
color_discrete_sequence=['red'], # Set the color to red
nbins=50 # You can adjust the number of bins if needed
)
# Adjust the figure size
fig.update_layout(
width=500, # Width in pixels
height=300, # Height in pixels
title='Distribution of num_procedures',
xaxis_title='num_procedures',
yaxis_title='Count'
)
fig.show()
readmitted¶
plt.figure(figsize=(5, 3))
df['readmitted'].value_counts().plot(kind='bar')
plt.show()
A1Cresult¶
plt.figure(figsize=(5,3))
sns.countplot(x='A1Cresult', data=df, color='red')
plt.show()
change¶
plt.figure(figsize=(5,3))
df['change'].value_counts().plot(kind='bar')
plt.show()
# Iterate through each element in 'drug'
for i in drug:
# Check if 'i' is a valid column name in the DataFrame
if i in df.columns:
# Plot the count of each unique value in the column
sns.countplot(x=df[i]) # Update this line to select the column correctly
plt.title(f'Count of {i}')
plt.xlabel(i)
plt.ylabel('Count')
plt.show()
else:
print(f"'{i}' is not a valid column name in the DataFrame.")
'repaglinide' is not a valid column name in the DataFrame. 'nateglinide' is not a valid column name in the DataFrame. 'chlorpropamide' is not a valid column name in the DataFrame. 'glimepiride' is not a valid column name in the DataFrame. 'acetohexamide' is not a valid column name in the DataFrame.
'tolbutamide' is not a valid column name in the DataFrame. 'pioglitazone' is not a valid column name in the DataFrame. 'rosiglitazone' is not a valid column name in the DataFrame. 'acarbose' is not a valid column name in the DataFrame. 'miglitol' is not a valid column name in the DataFrame. 'troglitazone' is not a valid column name in the DataFrame. 'tolazamide' is not a valid column name in the DataFrame. 'examide' is not a valid column name in the DataFrame. 'citoglipton' is not a valid column name in the DataFrame.
'glyburide-metformin' is not a valid column name in the DataFrame. 'glipizide-metformin' is not a valid column name in the DataFrame. 'glimepiride-pioglitazone' is not a valid column name in the DataFrame. 'metformin-rosiglitazone' is not a valid column name in the DataFrame. 'metformin-pioglitazone' is not a valid column name in the DataFrame.
DiabetesMed¶
import plotly.express as px
# Get the counts of each unique value in the 'diabetesMed' column
med_counts = df['diabetesMed'].value_counts()
# Plotting a pie chart using Plotly
fig = px.pie(values=med_counts, names=med_counts.index,
title='diabetesMed Distribution',
labels={'names':'diabetesMed'},
hole=0) # Adjust 'hole' to create a donut chart if desired
# Show the plot
fig.show()
readmitted¶
readmit_counts = df['readmitted'].value_counts()
# Plotting a pie chart
plt.figure(figsize=(5,3))
plt.pie(readmit_counts, labels=readmit_counts.index, autopct='%1.1f%%', startangle=140)
plt.title('Readmission Distribution')
plt.axis('equal') # Equal aspect ratio ensures that pie is drawn as a circle.
plt.show()
discharge_disposition_id¶
plt.figure(figsize=(5,3))
df['discharge_disposition_id'].value_counts().plot(kind='bar')
plt.show()
Bivariate Analysis¶
import seaborn as sns
pd.crosstab(df['age'],df['readmitted'])
#0=not readmitted,1=readmitted
| readmitted | 0 | 1 |
|---|---|---|
| age | ||
| 5 | 62 | 1 |
| 15 | 437 | 31 |
| 25 | 1289 | 217 |
| 35 | 3195 | 410 |
| 45 | 8380 | 1015 |
| 55 | 15188 | 1659 |
| 65 | 19380 | 2481 |
| 75 | 22114 | 3040 |
| 85 | 14278 | 2047 |
| 95 | 2276 | 305 |
pd.crosstab(df['diabetesMed'],df['readmitted'])
| readmitted | 0 | 1 |
|---|---|---|
| diabetesMed | ||
| 0 | 20086 | 2222 |
| 1 | 66513 | 8984 |
pd.crosstab(df['A1Cresult'],df['diabetesMed'])
| diabetesMed | 0 | 1 |
|---|---|---|
| A1Cresult | ||
| 0 | 19599 | 61794 |
| 1 | 1225 | 3642 |
| 2 | 1484 | 10061 |
pd.crosstab(df['A1Cresult'],df['readmitted'])
| readmitted | 0 | 1 |
|---|---|---|
| A1Cresult | ||
| 0 | 71839 | 9554 |
| 1 | 4388 | 479 |
| 2 | 10372 | 1173 |
pd.crosstab(df['A1Cresult'],df['change'])
| change | 0 | 1 |
|---|---|---|
| A1Cresult | ||
| 0 | 45169 | 36224 |
| 1 | 2716 | 2151 |
| 2 | 4461 | 7084 |
pd.crosstab(df['max_glu_serum'],df['readmitted'])
| readmitted | 0 | 1 |
|---|---|---|
| max_glu_serum | ||
| 0 | 82190 | 10551 |
| 1 | 2212 | 292 |
| 2 | 2197 | 363 |
pd.crosstab(df['max_glu_serum'],df['A1Cresult'])
| A1Cresult | 0 | 1 | 2 |
|---|---|---|---|
| max_glu_serum | |||
| 0 | 76619 | 4810 | 11312 |
| 1 | 2407 | 38 | 59 |
| 2 | 2367 | 19 | 174 |
pd.crosstab(df['insulin'],df['readmitted'])
| readmitted | 0 | 1 |
|---|---|---|
| insulin | ||
| 0 | 67185 | 8089 |
| 1 | 19414 | 3117 |
- Insulin vs Readmission
# Plotting
plt.figure(figsize=(5,3)) # Adjust the figure size if needed
sns.countplot(data=df, x='insulin', hue='readmitted')
plt.title('Readmission vs Insulin')
plt.xlabel('Insulin')
plt.ylabel('Count')
plt.legend(title='Readmission', loc='upper right') # Add legend for readmission
plt.tight_layout() # Adjust layout to prevent clipping of labels
plt.show()
- change vs readmitted
plt.figure(figsize=(5,3))
sns.countplot(x='change', hue='readmitted', data=df)
plt.xlabel('Change')
plt.ylabel('Count')
plt.title('Count of Readmissions by Change')
plt.show()
- diabetesMed vs readmitted
plt.figure(figsize=(5,3))
sns.countplot(x='diabetesMed', hue='readmitted', data=df)
plt.xlabel('Diabetes Medication')
plt.ylabel('Count')
plt.title('Count of Readmissions by Diabetes Medication')
plt.legend(title='Readmitted', loc='upper right')
plt.show()
- Time_in_hospital vs readmitted
import plotly.express as px
# Plotting a count plot using Plotly
fig = px.histogram(df,
x='time_in_hospital',
color='readmitted',
title='Readmission vs Time in Hospital',
labels={'time_in_hospital':'Time in Hospital', 'count':'Count'},
barmode='group', # Ensures that bars for different categories are shown side by side
category_orders={"readmitted": ["No", "Yes"]} # Optional: specify order if needed
)
# Customize layout to adjust the size and position of elements
fig.update_layout(
width=500, # Width in pixels
height=300, # Height in pixels
legend_title_text='Readmission', # Title for the legend
legend=dict(
x=1, # Position legend on the right
y=1, # Position legend at the top
traceorder="normal"
)
)
# Show the plot
fig.show()
- max_glu_serum vs readmission
# Plotting
plt.figure(figsize=(5,3)) # Adjust the figure size if needed
sns.countplot(data=df, x='max_glu_serum', hue='readmitted')
plt.title('Readmission vs max_glu_serum')
plt.xlabel('max_glu_serum')
plt.ylabel('Count')
plt.legend(title='Readmission', loc='upper right') # Add legend for readmission
plt.tight_layout() # Adjust layout to prevent clipping of labels
plt.show()
- A1Cresult vs readmission
# Plotting
plt.figure(figsize=(5,3)) # Adjust the figure size if needed
sns.countplot(data=df, x='A1Cresult', hue='readmitted')
plt.title('Readmission vs A1Cresult')
plt.xlabel('A1Cresult')
plt.ylabel('Count')
plt.legend(title='Readmission', loc='upper right') # Add legend for readmission
plt.tight_layout() # Adjust layout to prevent clipping of labels
plt.show()
- gender vs readmission
# Plotting
plt.figure(figsize=(5,3)) # Adjust the figure size if needed
sns.countplot(data=df, x='gender', hue='readmitted')
plt.title('Readmission vs gender')
plt.xlabel('gender')
plt.ylabel('Count')
plt.legend(title='Readmission', loc='upper right') # Add legend for readmission
plt.tight_layout() # Adjust layout to prevent clipping of labels
plt.show()
- age vs readmission
import plotly.express as px
# Plotting a count plot using Plotly
fig = px.histogram(df,
x='age',
color='readmitted',
title='Readmission vs Age',
labels={'age':'Age', 'count':'Count'},
barmode='group' # Ensures that bars for different categories are shown side by side
)
# Customize layout to adjust the size and position of elements
fig.update_layout(
width=500, # Width in pixels
height=300, # Height in pixels
legend_title_text='Readmission', # Title for the legend
legend=dict(
x=1, # Position legend on the right
y=1, # Position legend at the top
traceorder="normal"
)
)
# Show the plot
fig.show()
- metformin vs readmission
# Plotting
plt.figure(figsize=(5,3)) # Adjust the figure size if needed
sns.countplot(data=df, x='metformin', hue='readmitted')
plt.title('Readmission vs metformin')
plt.xlabel('metformin')
plt.ylabel('Count')
plt.legend(title='Readmission', loc='upper right') # Add legend for readmission
plt.tight_layout() # Adjust layout to prevent clipping of labels
plt.show()
- glipizide vs readmission
# Plotting
plt.figure(figsize=(5,3)) # Adjust the figure size if needed
sns.countplot(data=df, x='glipizide', hue='readmitted')
plt.title('Readmission vs glipizide')
plt.xlabel('glipizide')
plt.ylabel('Count')
plt.legend(title='Readmission', loc='upper right') # Add legend for readmission
plt.tight_layout() # Adjust layout to prevent clipping of labels
plt.show()
- glyburide vs readmission
# Plotting
plt.figure(figsize=(5,3)) # Adjust the figure size if needed
sns.countplot(data=df, x='glyburide', hue='readmitted')
plt.title('Readmission vs glyburide')
plt.xlabel('glyburide')
plt.ylabel('Count')
plt.legend(title='Readmission', loc='upper right') # Add legend for readmission
plt.tight_layout() # Adjust layout to prevent clipping of labels
plt.show()
- num_medications vs readmission
import plotly.express as px
# Plotting a count plot using Plotly
fig = px.histogram(df,
x='num_medications',
color='readmitted',
title='Readmission vs num_medications',
labels={'num_medications':'Number of Medications', 'count':'Count'},
barmode='group' # Ensures that bars for different categories are shown side by side
)
# Customize layout to adjust the size and position of elements
fig.update_layout(
width=800, # Width in pixels
height=500, # Height in pixels
legend_title_text='Readmission', # Title for the legend
legend=dict(
x=1, # Position legend on the right
y=1, # Position legend at the top
traceorder="normal"
)
)
# Show the plot
fig.show()
- num_lab_procedures vs readmission
import plotly.express as px
# Plotting a grouped bar chart using Plotly
fig = px.histogram(df,
x='num_lab_procedures',
color='readmitted',
title='Readmission vs num_lab_procedures',
labels={'num_lab_procedures':'Number of Lab Procedures', 'count':'Count'},
barmode='group' # Ensures that bars for different categories are shown side by side
)
# Customize layout to adjust the size and position of elements
fig.update_layout(
width=1000, # Width in pixels
height=800, # Height in pixels
legend_title_text='Readmission', # Title for the legend
legend=dict(
x=1, # Position legend on the right
y=1, # Position legend at the top
traceorder="normal"
)
)
# Show the plot
fig.show()
- num_procedures
# Plotting
plt.figure(figsize=(5,3)) # Adjust the figure size if needed
sns.countplot(data=df, x='num_procedures', hue='readmitted')
plt.title('Readmission vs num_procedures')
plt.xlabel('num_procedures')
plt.ylabel('Count')
plt.legend(title='Readmission', loc='upper right') # Add legend for readmission
plt.tight_layout() # Adjust layout to prevent clipping of labels
plt.show()
- time_in_hospital vs readmission
# Plotting
plt.figure(figsize=(8, 6)) # Adjust the figure size if needed
sns.countplot(data=df, x='time_in_hospital', hue='readmitted')
plt.title('Readmission vs time_in_hospital')
plt.xlabel('time_in_hospital')
plt.ylabel('Count')
plt.legend(title='Readmission', loc='upper right') # Add legend for readmission
plt.tight_layout() # Adjust layout to prevent clipping of labels
plt.show()
df.dtypes
gender int64 age int64 admission_type_id int64 discharge_disposition_id int64 admission_source_id int64 time_in_hospital int64 num_lab_procedures int64 num_procedures int64 num_medications int64 diag_1 int32 diag_2 int32 diag_3 int32 number_diagnoses int64 max_glu_serum int64 A1Cresult int64 metformin int64 glipizide int64 glyburide int64 insulin int64 change int64 diabetesMed int64 readmitted int64 service_utilization int64 Asian bool Caucasian bool Hispanic bool Other bool dtype: object
Relation with Target variable¶
import pandas as pd
import numpy as np
print("According to the data there are total 50 columns")
# Select categorical columns
df_cat = df.select_dtypes(include='object')
# Select numerical columns
df_num = df.select_dtypes(include=np.number)
print('Total categorical columns : ', len(df_cat.columns))
print('\n')
print('Categorical column names : ', df_cat.columns)
print('\n')
print('Total numerical columns : ', len(df_num.columns))
print('\n')
print('Numerical column names : ', df_num.columns)
According to the data there are total 50 columns
Total categorical columns : 0
Categorical column names : Index([], dtype='object')
Total numerical columns : 23
Numerical column names : Index(['gender', 'age', 'admission_type_id', 'discharge_disposition_id',
'admission_source_id', 'time_in_hospital', 'num_lab_procedures',
'num_procedures', 'num_medications', 'diag_1', 'diag_2', 'diag_3',
'number_diagnoses', 'max_glu_serum', 'A1Cresult', 'metformin',
'glipizide', 'glyburide', 'insulin', 'change', 'diabetesMed',
'readmitted', 'service_utilization'],
dtype='object')
# Calculate correlation matrix
correlation_matrix = df.corr()
# Calculate correlation of each feature with the target variable
correlation_with_target = correlation_matrix['readmitted'].sort_values(ascending=False)
# Print correlation of each feature with the target variable
print("Correlation with target variable 'readmitted':")
print(correlation_with_target)
# Visualize correlation matrix as a heatmap
plt.figure(figsize=(22, 18))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Heatmap')
plt.show()
Correlation with target variable 'readmitted': readmitted 1.000000 service_utilization 0.126385 discharge_disposition_id 0.064165 number_diagnoses 0.050132 time_in_hospital 0.045807 insulin 0.040826 num_medications 0.039301 diabetesMed 0.025548 num_lab_procedures 0.023401 age 0.018844 change 0.018185 max_glu_serum 0.013159 diag_3 0.009672 glipizide 0.009611 diag_2 0.008928 Caucasian 0.002037 gender -0.002070 Asian -0.002084 Hispanic -0.004049 glyburide -0.005139 diag_1 -0.005402 Other -0.006676 metformin -0.007274 num_procedures -0.011721 admission_type_id -0.012577 A1Cresult -0.018297 admission_source_id -0.019333 Name: readmitted, dtype: float64
Service Utilization (0.126385): There is a moderate positive correlation between service utilization and readmission. Patients who utilize more services during their stay are slightly more likely to be readmitted.
Discharge Disposition ID (0.064165): There is a weak positive correlation between the discharge disposition ID and readmission. Certain discharge dispositions may be associated with higher chances of readmission.
Number of Diagnoses (0.050132): There is a weak positive correlation between the number of diagnoses and readmission. Patients with a higher number of diagnoses tend to have slightly higher readmission rates.
Time in Hospital (0.045807): There is a weak positive correlation between the time spent in the hospital and readmission. Longer hospital stays may be associated with slightly higher readmission rates.
Insulin (0.040826): There is a weak positive correlation between insulin usage and readmission. Patients prescribed insulin may have slightly higher chances of readmission.
Number of Medications (0.039301): There is a weak positive correlation between the number of medications prescribed and readmission. Patients on more medications may have slightly higher readmission rates.
Diabetes Medication (0.025548): There is a weak positive correlation between diabetes medication and readmission. Patients prescribed diabetes medication may have slightly higher chances of readmission.
Number of Lab Procedures (0.023401): There is a weak positive correlation between the number of lab procedures performed and readmission. Patients undergoing more lab procedures may have slightly higher readmission rates.
Age (0.018844): There is a weak positive correlation between age and readmission. Older patients may have slightly higher chances of readmission.
Change (0.018185): There is a weak positive correlation between change in medication and readmission. Changes in medication regimen may be associated with slightly higher readmission rates.
Model building¶
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
df.head()
| gender | age | admission_type_id | discharge_disposition_id | admission_source_id | time_in_hospital | num_lab_procedures | num_procedures | num_medications | diag_1 | ... | glyburide | insulin | change | diabetesMed | readmitted | service_utilization | Asian | Caucasian | Hispanic | Other | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 0 | 15 | 0 | 0 | 0 | 3 | 59 | 0 | 18 | 8 | ... | 0 | 1 | 1 | 1 | 0 | 0 | False | True | False | False |
| 2 | 0 | 25 | 0 | 0 | 0 | 2 | 11 | 5 | 13 | 8 | ... | 0 | 0 | 0 | 1 | 0 | 3 | False | False | False | False |
| 3 | 1 | 35 | 0 | 0 | 0 | 2 | 44 | 1 | 16 | 8 | ... | 0 | 1 | 1 | 1 | 0 | 0 | False | True | False | False |
| 4 | 1 | 45 | 0 | 0 | 0 | 1 | 51 | 0 | 8 | 7 | ... | 0 | 0 | 1 | 1 | 0 | 0 | False | True | False | False |
| 5 | 1 | 55 | 0 | 0 | 1 | 3 | 31 | 6 | 16 | 1 | ... | 0 | 0 | 0 | 1 | 0 | 0 | False | True | False | False |
5 rows × 27 columns
X=df.drop('readmitted',axis=1)
y=df.readmitted
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.30,random_state=1)
print('X_train',X_train.shape)
print('y_train',y_train.shape)
print('X_test',X_test.shape)
print('y_test',y_test.shape)
X_train (68463, 26) y_train (68463,) X_test (29342, 26) y_test (29342,)
LogisticRegression¶
# Initialize and train the Logistic Regression model
model = LogisticRegression()
model.fit(X_train, y_train)
# Make predictions
y_pred = model.predict(X_test)
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)
print(f'Accuracy: {accuracy}')
print('Confusion Matrix:')
print(conf_matrix)
print('Classification Report:')
print(class_report)
Accuracy: 0.8839888214845614
Confusion Matrix:
[[25912 42]
[ 3362 26]]
Classification Report:
precision recall f1-score support
0 0.89 1.00 0.94 25954
1 0.38 0.01 0.02 3388
accuracy 0.88 29342
macro avg 0.63 0.50 0.48 29342
weighted avg 0.83 0.88 0.83 29342
DecisionTree¶
# Initialize and train the Logistic Regression model
model = DecisionTreeClassifier()
model.fit(X_train, y_train)
# Make predictions
y_pred = model.predict(X_test)
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)
print(f'Accuracy: {accuracy}')
print('Confusion Matrix:')
print(conf_matrix)
print('Classification Report:')
print(class_report)
Accuracy: 0.7860745688773771
Confusion Matrix:
[[22456 3498]
[ 2779 609]]
Classification Report:
precision recall f1-score support
0 0.89 0.87 0.88 25954
1 0.15 0.18 0.16 3388
accuracy 0.79 29342
macro avg 0.52 0.52 0.52 29342
weighted avg 0.80 0.79 0.79 29342
RandomForest¶
# Initialize and train the Logistic Regression model
model = RandomForestClassifier()
model.fit(X_train, y_train)
# Make predictions
y_pred = model.predict(X_test)
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)
print(f'Accuracy: {accuracy}')
print('Confusion Matrix:')
print(conf_matrix)
print('Classification Report:')
print(class_report)
Accuracy: 0.8846704382795992
Confusion Matrix:
[[25947 7]
[ 3377 11]]
Classification Report:
precision recall f1-score support
0 0.88 1.00 0.94 25954
1 0.61 0.00 0.01 3388
accuracy 0.88 29342
macro avg 0.75 0.50 0.47 29342
weighted avg 0.85 0.88 0.83 29342
Findings¶
If the patient has the following characteristics he has a high probability of being readmitted
- High preceding year visit
- If the patient is discharge to another medical facility or discharged to home with health services
- High number of diagnoses
- If the patient is given diabetes medicines
- If the primary diagnosed disease was of circulatory system
- If metformin and/or insulin is not being given or the dosage is low
- If secondary disgnosis was coming to be Diabetes
- If A1C test was not performed
CHALLENGES ENCOUNTERED¶
- The major challenge was to acquire sufficient domain knowledge of the medical world
- Huge amount of Data
Business Ideas¶
Medical facilities can take precautionary measures with patients during their initial admission by making A1C and Maximum Glucose Serum test compulsory and providing the treatment accordingly
Providing extra attention and care to high-risk patients
A follow-up with the discharged patients should be done to keep a track of their health and to counsel them from time-to-time
High-risk patients current medicines regime should be re-evaluated and the most effective medicines should be considered
Most effective Medications:- Metformin,Glipizide,Insulin
The annual plan,financials and infrastructure/inventory should be planned accordingly by taking into account the predicted readmissions.
Conclusions¶
1.Patients with higher service utilization, longer time in hospital, and higher number of medications are more likely to be readmitted.
2.Medications such as Metformin, Glipizide, and Insulin show significant association with reduced readmissions, highlighting their importance in diabetes management.
3.Demographic factors such as age, gender, and race do not have a strong correlation with readmissions, indicating that clinical factors play a more significant role.
4.Predictive models, such as logistic regression and decision trees, can accurately predict hospital readmissions for diabetic patients based on clinical and demographic variables.
5.Feature engineering techniques, such as combining service utilization variables, can improve the predictive performance of models.
6.Preserving outliers in the dataset is crucial as they often represent extreme cases with significant predictive power.
7.Addressing class imbalance in the target variable improves the generalization of predictive models and reduces bias.
8.Model regularization techniques help prevent overfitting and improve the robustness of predictive models.
9.Annual planning and resource allocation in healthcare facilities should consider predicted readmissions to optimize care delivery and reduce costs.
10.Implementing targeted interventions, such as compulsory A1C and Maximum Glucose Serum tests for high-risk patients, can help prevent readmissions and improve patient outcomes.
- These conclusions and business ideas can serve as valuable insights for healthcare providers, policymakers, and stakeholders looking to address the challenge of hospital readmissions for diabetic patients and improve overall healthcare outcomes.